


BULLETIN OF THE SCHOOL OF 
EDUCATION, INDIANA UNIVERSITY 





Entered as second-class matter September 30, 1924, at the post-office at Bloom- 
ngton, Indiana, under the act of August 24, 1912. Published six times a year from 
the University Office, Bloomington, Indiana. 





Vol. I BLOOMINGTON, IND. 


Eleventh 
Conference on Educational 


Measurements 


JANUARY, 1925 














For sale by the University Bookstore, Bloomington, Ind. 


Price, 50 cents. 


A limited number of copies of this bulletin will be distributed free to 
citizens of Indiana. 

















ELEVENTH ANNUAL CONFERENCE 


ON 


EDUCATIONAL 


MEASUREMENTS 


Held at Indiana University, Bloomington, Ind., Friday 
and Saturday, April 18 and 19, 1924 


PUBLISHED By 
THE SCHOOL OF EDUCATION OF INDIANA UNIVERSITY 
1925 











Contents 


AN EXPERIMENT IN INTERFERENCE AND LEARNING IN GIVING THE STANFORD 
ACHIEVEMENT Tests. By Cart G. F. Franzén, Associate Professor of 
Secondary Education, and Herman F. Youna, Associate Professor ol 
Psychology, Indiana University 


Suacestions ON How To Keep Usaste PerMaNENT ReEcorpDs oF MENTAL 
AND ACHIEVEMENT Tests. By HermanH. Younae......... 

MEASURING THE BUDGETARY PROCEDURE OF A SCHOOL System. By Haroip 
F. Ciark, Assistant Professor of Education, Indiana University.... . 


IMPROVING CoMPREHENSION ABILITY IN SENT Reapinc, By Grover H. 

ALDERMAN, Professor of Elementary Education, Indiana University... . 
Tue ApvANTAGs or Apiuity Groupinc. By Cuiirrorp Woopy, Director of 
Bureau of Educational Reference and Research, University of Michigan 


RESULTS FROM Successive REPETITIONS OF CERTAIN ARITHMETIC TESTS. By 
Currorp Woopy.... 


THe PERMANENT INFLUENCE OF THE TEACHING OF SPELLING. By CLIFFORD 


INDIVIDUAL DEVELOPMENT AS SHOWN BY REPEATED MEASUREMENTS. By 


Water F. Dearsorn, Psycho-Educational Clinic, Graduate School of 
Education, Harvard University........................06- 
RELIABILITY AND Uses or Group Tests or INTELLIGENCE. 


By Wa .rTerR F. 
DEARBORN... 


Specrat DisaBmity in LEARNING TO Reap. By Water F. DeaRzorn...... 


(2) 


Page 


61 























o «= wT 








. 





An Experiment in Interference and Learning 
in Giving the Stanford Achievement Tests 


CARL G. F. FRANZEN, Associate Professor of Secondary Education, and 
HERMAN H. YOUNG, Associate Professor of Psychology, 
Indiana University 


THE origin of this experiment dates back to the county unit testing 
that was carried on in October, 1923, in LaGrange, Johnson, Whitley, 
and Rush counties. Under the direction of Mr. Frank L. Shaw the 
Stanford Achievement Test and the Haggerty No. 1 Silent Reading 
Test were adapted for this particular purpose. Since there were so 
many one-room rural schools to be tested, and since a maximum time 
of three hours was all that could be devoted to the giving of the tests, 
it was necessary to devise some scheme of administration which would 
enable the giver of the tests to get thru within the allotted time. This 
was made particularly necessary since the Stanford Achievement Test 
was to be given to all pupils from the third grade thru the eighth grade 
at the same time, a situation for which the original tests did not make 
provision, and since the Haggerty Test was to be given to pupils of 
Grades 2 and 3. 

The procedure. outlined was as follows: After the Stanford 
Achievement booklets had been distributed, the instructions read, and 
the information blanks filled out, all the pupils in the third grade and 
up were given Test ‘No. 9, the dictation spelling test, which required 
about thirty-five minutes. At the conclusion of the dictation test all of 
the pupils were started on Test No. 1. At the end of fifteen minutes 
the third grade was told to stop. After an interval of one minute they 
were given instructions and began Test No. 2. In the meantime, the 
five upper grades had completed Test No. 1 in twenty minutes and were 
waiting until Grade 3 had finished Test No. 2. After an intermission 
of five minutes all six grades were called together and Grade 3 was 
started out on Test No. 3. While they were busy on it the upper group 
received instruction for Test No. 2 and began working on it. In this 
way both groups were kept busy on different tests until all the tests 
were completed. 

The table below shows the sequence in which the tests were given, 
and how it was possible to give the Haggerty Tests to the second and 
third grades while, the five upper grades were finishing the Stanford 
Test: 
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TABLE I.—ORDER OF TESTS IN GROUPS WHICH CONTAIN 

















GRADE 3 
Grades 4—8 ; Grade 3 
Minutes Minutes 
Preparation (Stanford)......... 5x Preparation (Stanford)......... 5x 
Dictation (Stan. 9)......... 35x Dictation (Stan. 9)......... 35x 
Paragraph (Stan. 1)......... 20x Paragraph (Stan. 1)......... 15x 
(Interval of 1 minute) 
Sentence Gee = Bas, ices 5 
WOR Ms hs ees ct kas 61 61 
Se ES ne ai 5 RE Be eee oe 5 
Sentence SE. Dex: shi b ces 10 Ist. 
Word Soe 10 Word (Stan. 3)....... 5 
Computation (Stan. 4).......... 20 Computation (Stan. 4).......10 
Reasoning (Stan. 5).......10 
pS or A at er pane ee 15 
wn caste nks 43 43 
SON 8b 8c 2 ssn 0 twa 5 5 
Ist 
Reasoning (Stan. 5).......... 20 Preparation (Haggerty)......... 6 
Language (Stan. 8)......4.... 8 bo, SEE Pome Peay 2 
gL) ee earner aera et 20 
Total time....:........ 30 Haggerty test also given to Grade 2 
Grand total time......144 144 


i= to ba.given toget her. 

Such was the original plan. However, when the directors for the 
various counties met on the Saturday preceding the week in which the 
tests were to be given, so much objection was raised on the part of 
those who were to give these tests that Dean Smith and Mr. Shaw were 
finally prevailed upon to arrange a schedule of such a nature that there 
would be no possibility of interference—that is, any chance for a group 
like that of the third grade to be given directions for a test, while 
another test was being performed by the five upper grades. The objec- 
tions were of two types: first, that this apparently direct interference 
would prevent effective work on the part of the pupils; second, that it 
would be difficult for the examiner to shift so rapidly from one test to 
another and be able to keep accurate record of the time involved. As 
has been stated, these objections were at the time sufficient to cause a 
change in the program, and it was according to this changed program 
that the tests were given in October, 1928, and also in April, 1924. 
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At the close of the first testing, Mr. Shaw raised the question about 
the advisability of using the same test that had been used the first 
year in the same form for testing the children the second year, and 
suggested an experiment be undertaken by the University to help decide 
this question. At the same time Dean Smith made the suggestion of 
trying to so organize the experiment that data could be collected upon 
the relative effect of the element of interference in the application of 
the tests. It was with these two objectives in mind that this experiment 
was undertaken. 

Thru the kindness and courtesy of Superintendent William H. Jones 
of Monroe County, permission was received to select eight schools in 
the county for the purpose of the experiment. A preliminary investiga- 
tion to determine just which schools would best be suited to the pur- 
poses of this investigation led to the selection of four one-room rural 
schools: the Dolan, the Payne, the Waterworks, and the Headley; and 
four in which the third and fourth grades predominated in one room: 
Richland, Smithville, Sanders, and Clear Creek. Dr. H. F. Clark and 
Mr. Dale Russell, of the School of Education, assisted in the adminis- 
tering of the tests. Both of them had taken part in the county unit 
testing program, so that no one was connected with this experiment 
who had not had previous training on these very tests. 

Four series of tests were administered in the months of November, 
December, February, and March. An average interval of four months 
separated the first testing from the fourth. Testing was given in the 
following order: 





Method Method 
Group A* Group B* 
1. The Stanford Achievement Tests and with without 
Haggerty Silent Reading Test interference interference 
2. The Woody-McCall Mixed Fundamentals, 
Form 1, and 
The Kelley-Trabue Completion Exer- without with 
cise—Beta interference interference 
3. The Stanford Achievement Tests and without without 
Haggerty Silent Reading Test interference interference 
4. The Woody-McCall Mixed Fundamentals, 
Form 1, and 
The Kelley-Trabue Completion Exer- without without 
cise—Beta interference interference 


*Group A included the Dolan, Headley, Smithville, and Richland schools. 


Group 


B included the Payne, Waterworks, Sanders, and Clear Creek schools, 


It was decided not to give the spelling test in connection with the 
first, because it consumed 


Stanford Achievement Test for two reasons: 





so much time—approximately forty minutes; and second, because it 
would not contain the factor of interference. 

The purpose of putting in the Woody-McCall and the Kelley-Trabue 
Tests was primarily to measure the amount of improvement, because in 
the Stanford Test both of these types appear. However, in the final 
results these tests were used chiefly to discover the effect of interference. 
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The next step was to arrange for the factor of interference. Four 
schools were selected for the first testing, to which was applied the 
original program, and four others the modified program, in which there 
was no interference. In the first group of schools the three succeeding 
tests were all given without any interference, whereas in the second 
group interference took place in the second testing series. 

After the tests had been given, the results were tabulated by the 
Bureau of Codperative Research. So many unforeseen circumstances 
had arisen during the time in which the tests were administered that 
it was found impracticable to use satisfactorily results other than those 
of the third and fourth grades. These grades were the only ones which 
had a sufficiently large number of pupils all the way thru the experi- 
ment to permit of reasonably significant statistical treatment. The orig- 
inal intention had been to analyze the results according to the scheme 
that had been proposed by Mr. Shaw. Its main elements were to assem- 
ble all pupils of the same grade and then to group them by two criteria: 
first, age; and second, similar scores. This procedure, of course, would 
have been used for the first series. Upon the repetition of the test, the 
scores made by the various individuals would be set off against the 
original form of grouping, and any changes in scores due to inter- 
ference or to increased familiarity with the test would then be evident. 
The small numbers in the upper grades made it impossible to adopt this 
scheme because of the circumstances previously mentioned. Conse- 
quently, we were practically forced to adopt a group method of com- 
parison, that is, to find the average scores made by the third and fourth 
grade pupils who took all four tests, and then draw such conclusions as 
seemed to follow. 

Table II gives the summary of the results for the third and fourth 
grades. An examination of this table does not reveal much one way or 
another. Consequently, it has been condensed, in order, if possible, to 
reveal any outstanding characteristics. In the condensed table, Table 
III, page 8, the third and fourth grades have been separated, but in 
each instance the following facts have been noted: 


(a) That group which made the largest initial score, represented by X. 

(b) That group which made the greatest gain the second time the test 
was taken, represented by +. 

(c) Whether or not the group took the test with interference, repre- 
sented by I. 
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TABLE II.—SUMMARY OF RESULTS FOR THE THIRD AND 
FOURTH GRADES 








NUMBER| | PER 
TuirD GRADE Group OF | InimiaL | Frnau | Gain | CEnt 
or GAIN 





| CasEs | Score 
| | 














Stanford...........~ A | 23 | 95.69 | 125.52 | 29.83| 31 
B | 13 | 84.83] 118.00} 33.17| 39 

Kelley-Trabue.......... | A 23 | 4.08| 4.70| .62\ 15 
| B 13 | 3.36| 4.18] .s2| 24 
Woody-McCall.......... | A 23 | 8.43] 10.87| 244| 2 
| B 13 | 8.84| 10.38| 1.54] 17 
inte A 23 | 20.61| 24.35| 3.74] 18 
B | 13 | 19. | 24.38| 5.38] 28 

FourtH GRADE | 

nn Te ea NEE I A | 2% | 129.65 | 156 69 | 27.04| 21 
B | 13 | 173.54 | 203.31} 29.77 17 

Kelley-Trabue.......... eS. | 26 | 4.46 4.64 | 18 4 
| So 1 @ Tay ae ee 
Woody-McCall..........) A | 26 | 11.19 12.42) 1.23) 1 
| B° | 18 | 11.85] 15.31| 3.46] 29 











Group “A” refers to the four schools who took the Stanford Achieve- 


ment and Haggerty Silent Reading Tests with interference the first time 
they were given. 


Group “B” refers to the four schools who took these same tests the 
first time without interference. 

It will be noted that the Kelley-Trabue and the Woody-McCall tests 
were given as a second and fourth series.* The first time that they 
were given they were presented to Group B pupils with interference and 
to the Group A pupils without interference. In this way both groups 
were subjected to interference altho thru different tests. It must also 
be remembered that interference was given only the first time any of 
the tests were administered. 


*In order to provide for interference in the giving of the Kelley-Trabue and the 
Woody-McCall Tests, special directions were devised, which are here given: 


The following program will be followed in the administration of the Woody-McCall 
Mixed Fundamentals and the Kelley-Trabue Completion Test when the two tests are 
given without interference: 

Begin with the Woody-McCall test and allow all from the third grade up ten min- 
utes. At the completion of this test and the collection of the papers, distribute the 
Kelley-Trabue blanks; and after the instructions have been given and the test started 
allow the third grade ten minutes and a grade and up fifteen minutes. 


The following program will be followed for administering the Woody-McCall Mixed 
Fundamentals and the Kelley-Trabue Completion Test when there is interference: 

Start the third grade with the Woody-McCall Mixed Fundamentals. Allow ten min- 
utes for the test. Then turn to the fourth grade and up and distribute the Kelley-Trabue 
test. Give the directions, start the pupils, and allow fifteen minutes. When the third 
xrade has finished the Woody-McCall and while the fourth grade and up are still 
working on the Kelley-Trabue, begin the Kelley-Trabue with the third grade, and allow 
ten minutes. When the fourth grade and up have finished the Kelley-Trabue and while 
the third grade is still working on it, start the fourth grade and up with the Woody- 
McCall and allow ten minutes. 
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TABLE III 
| 
TEstTs Tuirp Grape | FourtH GRADE 
Ee BED an Fe Re ae a j 1.AX I "6. I 
Bi x 
Reta Seen 50.75 abs Sis oe 2AX 6. X I 
, B +I 
Weaernetelisc. Site oes. so ae | 3. A 4 7. X I 
BX I 
aeietg s 5 6. isis 5 SERIES ee | 4&4 AX I 
| B + 











In order to interpret Table III the following assumption has been 
made: 


If interference was a negative factor in the giving of the test, the 
initial score would be such that the groups to which interference was 
given would consistently make a greater corresponding gain on the 
second test than the group in which there was no interference. If such 
a group made a smaller gain, it would mean that interference was posi- 
tive, or of no effect at all. It must further be borne in mind that the 
time in both the Woody-McCall and the Kelley-Trabue Tests was so 
shortened that no one had time to finish. This situation was the opposite 
of that found in the various parts of the Stanford Achievement and the 
Haggerty Tests. Examiners noted that in these two last-mentioned 
tests, in practically all cases, there was ample time for the pupils to 
idle away many minutes, watching what was going on in the room, or 
in hit-and-miss fashion scoring the remainder of the tests after they 
had once reached their limit. 

It will be noted that in Table III there are seven comparisons pos- 
sible between Groups A and B as a basis for determining the probable 
effect of interference. In three out of the seven comparisons, numbers 
2, 6, and 7, the greatest gain was made by the group with which inter- 
ference was employed. 

It will also be noted that these three instances occur on the Kelley- 
Trabue and the Woody-McCall Tests. In four cases out of seven, num- 
bers 1, 3, 4, and 5, the group which did not have interference made the 
greatest gain whether or not they had the largest initial scores. In all 
four cases in which these comparisons were made in the Stanford and 
Haggerty Tests, it will be remembered from the statement made above 
that the pupils had ample opportunity to waste a lot of time. It does 
seem, then, that interference, so far as we have been able to determine, 
is a negative factor in rate tests, but that in power tests, where there 
is always a chance for a pupil to reach his limit before time is called, 
interference plays a negligible factor. Consequently, so far as Mr. 
Shaw’s original program and its application to the one-room rural 
schools is concerned, these results would indicate that it would be a 
legitimate procedure. 
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TABLE IV,—ACTUAL SCORE COMPARED TO EXPECTED INCREASE 
IN SCORE WITH REPETITION OF THE SAME TEST AFTER AN 
INTERVAL OF TWO MONTHS 








INCREASE IN ScORE 


GRADE | TEstT | | AcTUAL + 











| Exrecrep | Acrvat | Expecrep 
CiJaapt ad thet | Stanford No. 1..... | 2 8 4 
Wy stg tee Pde coed 3 Stanford No. 2..... 3.5 | 10 3 
a a A | Stanford Total.... | 11 29 2.6 
Sieg ae | Haggerty No. 1....| 7 2.1 3 
oy pe ae ......| Haggerty No. 2.. H 7 | 2.2 3 








Attention should be called to the fact that the scores of the test- 
ing given in this investigation in November are only about 70 per cent 
as high as the scores for the standard tests for September. 

In order to answer the question raised by Mr. Shaw as to the effect 
upon test results of giving exactly the same test the second time to the 
same group of children, the increase in the scores was determined from 
the first to the second test for the tests named in Table IV. From the 
standard tables furnished in the manuals for each of the tests the in- 
crease in score that might be expected in a two-months’ interval was 
calculated. Table IV gives this expected increase and the actual aver- 
age increase of all children of the schools tested. The actual increase 
divided by the expected increase shows that, on an average, the eight 
schools increased their scores three times as much as would be expected 
from the standards given in the manuals. This is in accord with results 
obtained by most other investigators on the effect of the repetition of 
the same tests in any given group.* 


*Duniap, K., and Snyder, A. “Practice Effects in Intelligence Tests.” Journal 
of Experimental Psychology, 1920, 8, 396-403. Thorndike, E. L. “Practice Effects in 
Intelligence Tests.” Journal of .Experimental Psychology, 1922, 5, 101-107. 














Suggestions on How to Keep Usable Permanent 
Records of Mental and Achievement Tests 


HERMAN H. YOUNG, Associate Professor of Clinical Psychology, 
Indiana University 


IF a test is worth buying, giving, and scoring, its results are valu- 
able enough to be kept permanently. If the results are not worth 
keeping permanently, the test is not worth giving. Exact records of 
what each child does should be kept in a convenient permanent form 
and should be referred to whenever questions concerning an important 
disposition of a child arise. 

Averages of classes, schools, etc., are of interest and of value in 
comparing groups, but they do not solve the problem of the individual 
child within these groups. To reach successfully all children assigned 
her the teacher must adapt her instruction to the needs of the members 
of her group as individuals and must reckon with their variations from 
the average. She cannot look up the average of her group and then 
proceed to instruct that average—and expect to make her teaching a 
success. 

Permanent records of individual children are indispensable to good 
school organization and instruction. A standard system of permanent 
records for the entire school system is the only method productive of 
good results. A teacher in a large school can keep records of her chil- 
dren during their short stay in her room, and may collect some interest- 
ing data on the constitution of her classes, but she can do very little 
constructive work, except as the principal becomes interested and extends 
the system to the entire school. The principal, thru a properly directed 
uniform system of records, can make an excellent analysis of his school 
and could organize it very well if he could so control the patronage that 
the personnei of his school remained identical. 

Unfortunately, even before he gets his organization outlined, his 
turnover of children begins. Out go some of the children whose records 
he has just completed, and in comes an equal, usually a larger, number 
upon whom no records have ever been kept. To place these new entrants 
in his school, he must either test them hurriedly or place them by guess. 
Hurried testing under such conditions is not very satisfactory. How- 
ever, unless some testing is done, the principal’s guesses will soon defeat 
his classification. Such experiences in school systems keeping no per- 
manent records cause teachers and principals to become discouraged 
and to drop uneasily in line with the uncertain progress of the school 
system of which they are exactly what they now appreciate themselves 
to be, small units of a larger, rambling educational program. 

Test results cannot be used efficiently unless uniform, permanent 
records of every test on every child of the entire school system are 
kept on cards provided especially for this purpose and follow the child 
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from the time he enters until he leaves school. This test record, together 
with his medical and promotion cards, should follow him immediately 
thru properly organized official channels into every new school as he 
moves about within the same school system. If he moves into a dif- 
ferent city, this record should follow him providing his new school 
requests it. 

Such a system of records enables schools to be organized on a scien- 
tific basis which is otherwise impossible. In the first place, it gets away 
from disconnected testing spasms which at best only give cross-sections 
of the school population. They seldom go beyond the computation of 
averages and the labeling of a smaller or larger group of children by 
some indefinite and ambiguous term as subnormal or retarded. Of 
course if the school authorities take these spells at the appropriate 
season of the year, they may copy the records of the children on a few 
scraps of paper and indulge in a bit of school reorganization. Such 
procedure invariably throws an appreciably large percentage of chil- 
dren into groups where they do not belong. Teachers and principals 
taking note of these misplacements make wholesale criticism of tests. 

Criticism, like charity, should begin at home. Criticism should be 
directed at the unreasonable expectations and methods employed in their 
application and not at the tests themselves. It is ridiculous to presume 
that any group test could ever be given to a class under such favorable 
conditions that every child would reach the zenith of his possibilities. 
Some children may be ill, others may be worried about home or school 
conditions, ete., etc., and consequently work far below their general 
average. 

As test papers and scores do not show these hindering factors, 
there is no way of deciding which children failed to do themselves 
justice on any particular test. Such is the danger and unfairness of 
classifying children on the basis of only one test. With a permanent 
record system and beginning with the second of any particular type of 
test, such as the second intelligence test, or the second arithmetic test, 
or the second silent reading test, children can be classified on the basis 
of tendencies which are revealed by their ratings on all tests of any 
special type. It is also possible to note improvement or its absence in 
any particular subject in which they have been given at least three 
tests. It should never be expected that only one group test will furnish 
an adequate basis of judgment and classification. In order to have 
enough test records on a child to give him justice, it is neither neces- 
sary nor wise to be testing him all the time. It is best to have records 
obtained at yearly or half-yearly intervals for each type of test for the 
entire period of school attendance. 

At this point attention must be called to the factors of time and 
clerical work necessary to keep a record system up to date. These two 
factors which furnish the big excuses for failures to keep records look 
like very flimsy excuses when subjected to analysis. The recording of 
results and keeping of records is a very simple matter compared with 
the money, time, and energy necessary to purchase, give, and score tests. 
Even if the keeping of records were difficult, it would be warranted 
because only thru permanent records can a harvest of valuable results 
be reaped from tests. 
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That many attempts to keep records have failed was the result of 
at least two outstanding factors. First, the system of records was 
inadequate; and, second, the records after being kept were useless and 
without value because the tests from which they were obtained did not 
have proper standards and could not be readily interpreted and com- 
pared. The remedy for the second of these causes of failure is to pur- 
chase only those tests which are so standardized that their records have 
a definite meaning. The remedy for the first of these causes of failure 
is for each school system to reckon with its own facilities and needs 
and then adopt a uniform record system of their own construction, or 
of one used elsewhere which will meet their individual requirements. 

There are a number of conditions which every record system must 
fulfil if it is to be successful. While it is probably impossible for any 
one card to meet satisfactorily all individual local requirements in dif- 
ferent communities, it is possible to state some of the basic require- 
ments needed wherever records are to be kept. 


1. Size of Record. Permanent record blanks should be large enough 
to carry all significant information of every test taken by each child 
from the time he enters until he leaves school. They should also be 
small enough to be handled and filed easily. Cards 5x8 or 84x11 inches 
are obviously the most satisfactory compromise between convenience in 
size and the immense amount of data of which one would often like a 
record. 

2. Value of Data. Every item of information should have a spe- 
cific and definite meaning either in and of itself or in reference to 
other recorded items. One important criterion here should be whether 
or not an item furnishes a definite basis of comparison for one test 
with all others taken by the same person and of one person with all 
others who have taken the same test. Some items, like the score, the 
date of birth, and the date of examination, which do not have any 
special value in and of themselves, are the corner stones of calculation 
for other significant items and must be recorded because of their poten- 
tial value. 

3. Arrangement of Information. The most significant and most 
used data should be recorded in a systematic arrangement on that part 
of the blank which is most convenient and conspicuous. This enables 
one looking up a specific bit of information to locate it instantly on 
every card. 

To meet the above practical requirements the record card, illus- 
trated here with records of two children, was constructed to keep records 
of tests given the Bloomington school children. This is 5x8 inches in 
size with the items and spaces arranged in the order and proportions 
indicated. The cards are printed alike on both sides. The sex of each 
child is indicated by color of card. White cards are used for girls’ 
records and salmon-colored cards for boys’ records. In order to make 
the cards easy to file and to make them easily found after being filed, 
the child’s name, with his last name coming first, is written in the upper 
left-hand corner. Race and nationality are not so important, but are 
often valuable, so they are recorded in the center of the card on the 
same line with the name. If for any reason the child has been studied 
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specially and considerable additional information on him has been col- 
lected and recorded elsewhere, the case number in the upper right-hand 
corner of the card will show this fact and tell where to find the infor- 
mation. Immediately below the name two lines (often none too many) 
have been left to record the child’s initial address and to indicate its 
changes during his school life. The first names of the father and mother 
are often an important means of deciding to which one of all the chil- 
dren having the same name a record belongs. The date of birth has 
value only as a point of reference, but it is such an important point 
that accurate estimates of a child’s future prospects cannot be made 
without it. Because it is so important it should be recorded absolutely 
correct. To furnish some evidence of its correctness the space below it 
is intended to indicate from whom the information about the child’s 
date of birth was obtained. Birth certificates are of course the best 
source of information. Ages should always be calculated from the birth 
date recorded upon the card. 

In the spaces to the right of the date of birth the child’s correct 
age in years and months on the first day of February and the first day 
of September is recorded for every year from the time he is given his 
first test until the spaces are all filled. It is easy to fill in these ages 
because after the first two are calculated the addition of one year each 
time straight across gives them correct for the first day of each semes- 
ter for the years covered. Filling these in eliminates the inefficient 
methods employed in their absence of (1) disregarding age, or of (2) 
going thru the tedium of age calculation for every child every time his 
card is considered with reference to his next promotion. This tedium 
generally takes a form something like this: “When John was given 
th: last test he was 11 years and 3 months old. That test was given 
in March. By the opening of school next September he would then be, 
let us see, March, April, May, June, July, August, and September, is 
6 months, he will be 11 years and 9 months old at the opening of school.” 
Something of this nature must be gone thru every time John comes up 
for special consideration. In the plan outlined here this process needs 
to be gone thru only twice, once for February first and once for Sep- 
tember first. After that a few strokes of the pen will give his age for 
every future promotion time. 

The remainder of the card consists of blank spaces for the record- 
ing of mental and educational tests. On this card there are fifteen lines 
sectioned off into sixteen columns. On each line may be written the 
record of one test with the information recorded in the appropriate 
columns as indicated by the headings of the columns. So far as pos- 
sible the arrangement of the columns was originally determined by the 
relative value of each item of information. The first column, headed 
Test, is for the abbreviation of the name of the test which was reported. 
The second column, headed Chronological Age, is for the child’s actual 
age in years and months on the day he took the test. The third column, 
headed Percentile, is for the child’s percentile rank as calculated from 
a standard percentile table. The fourth column, headed Performance 
Age, is for the so-called “mental age” on intelligence tests and for the 
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so-called “achievement age”, “educational age’, etc., on achievement and 
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educational tests. When scores on educational and achievement tests 
are evaluated in terms. of school grades, these grade equivalents may 
be recorded in this same column if the abbreviation “Gr.” is used and 
the grade is written in Roman numerals. Thus, “Gr. VI” would indi- 
cate that the child’s performance was that of a sixth grade child. The 
term “performance age” is employed as a general term for the age 
value of a child’s performance, i.e., of what he actually did on that 
particular test. The term “test age” would be just as appropriate and 
possibly more generally understood. The column headed Grade is for 
the child’s grade location in school. The next column is for the date 
on which this child took the test. His total score on the entire test is 
recorded in the following column, headed Score. Ratio to Norm in the 
next column is a general term intended to signify such ratios and co- 
efficients as “intelligence quotient”, “achievement quotient”, “attainment 
quotient”, etc., thru the entire list of values calculated somewhat simi- 
larly but labeled by various terms. School merely refers to the name 
of the school the child attended on the day he took the test. If the 
entire test is composed of sections of different types of tests, each desig- 
nated by a special number or title, the score on each of the successive 
sections may be recorded in order in the columns headed Score on each 
test. The last column at the right-hand edge of the card is headed 
Special Remarks and is used for the recording of additional significant 
data on any or all of the tests. 

For concrete illustrations of the usefulness of these cards the records 
of two children were selected from our files and are reproduced here 
correct in every detail except that fictitious names are substituted for 
the real ones, and the address and the names of the parents are omitted. 

The card of the boy, John Slow, shows him to be a white boy of 
American parentage born February 12, 1911. This record begins with 
his first test, the Indiana University Mental Survey Test, Schedule E, 
when he was 11 years and 9 months old and rated in the 5th percentile 
on this test. His rating in the 5th percentile on this test is determined 
from a standard percentile chart.* It means that 95 per cent of eleven- 
year-old children make better scores on this test than he did. In this 
way it shows exactly where he rates on this test. The space for per- 
formance age is left blank, because we feel very skeptical about the 
value of the so-called “mental ages” on group intelligence tests. John 
was in grade 3A at the time of this examination on November 21, 1922. 
He made a score of 27 on the test. The column Ratio to Norm is left 
blank because of our little faith in the value of “mental ages” upon 
which the intelligence quotients are based. John was attending Central 
School on the date of this test. The entire test is composed of four 
separate parts, on the first of which he made a score of 7, on the second 
of 7, and on the third and fourth of 8 and 5 respectively. 

* Young, Herman H. “How to Interpret and Make Use of Mental Tests.’ Tenth 


Conference on Educational Measurements, Bulletin of the Extension Division, Vol. VIII, 
No. 2 (July, 1923), pp. 26-48. 
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At the time John’s record on this first test was filled in on his card 
his age in years and months for the following February 1 was calcu- 
lated and recorded in the space provided for it in the heading of the 
ecard. The same was done for the following September 1. After this 
was recorded a mere glance showed that at the beginning of the next 
semester this 3A boy would be 11 years and 11 months old. It also 
showed that at the opening of school the next September he would be 
12 years and 6 months old. His age for all future promotion times that 
he will likely see was then calculate! by the simple process of adding 
one year to the one calculated last. 

The second test which John was given was Indiana University 
Test, Schedule F. He got this test because the unsatisfactory policy of 
giving tests by grades had not yet been displaced by the correct policy 
of giving tests by ages. Because Schedule F is intended for six-, seven-, 
and eight-year-old children, there are no standards for children over 
8 years of age and there is no way of reliably interpreting the results 
of this test on John. 

The Stanford Revision of the Binet-Simon Test was the third test 
given John. He was then 11 years and 10 months old. He rated with 
a mental age of 6 years and 6 months. At this time he was still in the 
3A grade because the test of January 9, 1923, came before the end of 
the semester. He rated with an intelligence quotient of 55. 

The Monroe Silent Reading Test was the next test which John took. 
He was then 12 years and 1 month old and in the 4B grade of Central 
School. In rate he made a score of 51. This is far below the average 
for third grade children. The best way of indicating this was by mark- 
ing his performance age as Grade III—. In comprehension his score, 
being 0, gave him a performance age of 0. 

The results of the last three tests are read in the same way. The 
percentile ratings are the most significant items on the card. These 
have been calculated on four different tests, the first of which came a 
year and four months before the last. The highest percentile he earned 
was that of 5. This would seem to be very significant evidence that he 
is mentally in the lowest 5 per cent of children his age, i.e., that at least 
95 per cent of children of his age are mentally superior to him. This 
enables us to locate him definitely, not only so far as his present con- 
dition is concerned, but also as to his future prospects. The evidence 
of the four tests combined is many more than four times as reliable and 
significant as that of any one of the tests alone could be. 

For contrast with John Slow the record of a girl was selected and 
is here reproduced under the fictitious name of Mary Bright. The data 
recorded for the various tests are read like those on John’s record. 
Four of the same tests are found on both records. The others are dif- 
ferent because of the difference in the ages and the grades of the two 
children. A mere glance at the columns headed percentile on the two 
records shows the remarkable difference that actually exists between 
the two children. Whereas John never rose above the 5th percentile, 
Mary never fell below the 92d percentile. A second glance shows that 
John’s tendency is to do things so poorly that all children of his age 
excel him, while Mary’s tendency is to do things so well that but few 
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children of her age surpass her. With this amount of evidence as to 
the persistent tendencies of these two children it is not only with ease, 
but also with considerable assurance, that their future educational and 
social possibilities and prospects can be predicated. 

Such evidence accumulated from psychoiogical and educational tests 
on children from the time they enter school makes intelligent educational 
and vocational guidance possible. 

The records of these two children are interesting not only because 
John rates so low and Mary rates so high, but also because each is con- 
sistent in rating nearly the same on all tests. Such close agreement on 
successive tests is unusual. About the greatest variation of percentile 
ratings generally found is illustrated by the records of (1) Lowell, 
whose successive percentile ratings are 67, 19, and 72; and (2) Glen, 
whose percentile ratings in the order he earned them are 37, 82, and 77. 
Both these boys had the same tests in the same order and on the same 
dates. The value of these accumulated records on these two boys will 
become apparent if we make several suppositions. First supposition: 
let us suppose these boys had been tested only once instead of three 
times. Second supposition: let us suppose the only test given them 
was the first test recorded. Then if both were judged by their ratings 
on this first test Lowell would appear to be mentally much superior to 
Glen, because he earned a percentile of 63, which means that only 37 
per cent of children of his age do better than he, while Glen earned a 
percentile of only 37, which means that 63 per cent of children his age 
do better than he. Third supposition: let us suppose that the only 
test given them was the second test recorded. Then if both were judged 
by their ratings on this second test Lowell would appear to be mentally 
much inferior to Glen, because he earned a percentile of only 19, which 
means that 81 per cent of children his age do better than he, while Glen 
earned a percentile of 82, which means that only 18 per cent of children 
his age do better than he. Fourth supposition: let us suppose that the 
only test given them was the third test recorded. Then if both were 
judged by their ratings on this third test alone they would appear to be 
mentally about equal, because the difference between their percentile 
ratings of 72 and 77 is relatively small. Such are the dangers and 
unfairness of comparing these two or any other two children with each 
other on the basis of only one test. The variations are just as great 
when these boys are compared to the average, e.g., under the second 
supposition Lowell rates above average and Glen below average, but 
under the third supposition these ratings are reversed. Lowell rates 
far below average and Glen rates far above average. Under the fourth 
supposition both rate nicely above average. 

Variations and what appear to be inconsistencies, but usually not 
such large ones, are to be expected from successive group tests. Yet 
again they may be consistent, as in the cases of John and Mary. The 
point is this: when only one group test is given there is no way of 
knowing for which children the results are fair and reliable indices of 
mental ability and for which children they are unfair and unreliable. 
After a second test which varies from the first there is no definite basis 
for deciding which one of the two is the fairer. For this reason it seems 
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that at least three tests with considerable time intervening between 
them are essential to warrant reasonable assurance in the reliability of 
the evidence of a child’s mental ability. In this connection it must be 
remembered that tests do not necessarily measure anything at all. They 
merely give circumstantial evidence which needs to be carefully evalu- 
ated and interpreted before it has any value. The value of test results 
therefore varies directly with the correctness and comprehensiveness of 
their interpretation. 

Certainly much more assurance can be placed upon the interpreta- 
tion of the results of all three tests given these boys, than upon that 
based upon any one of them. If these records are evaluated on the 
basis of tendencies in the same way as we spcke of the tendencies ex- 
hibited by the records of John and Mary, we will think of Lowell as 
rating at or a very little above average, and of Glen rating somewhat 
higher, that is, clearly above average. In general Lowell does not 
rate as high as Glen. The probabilities are that the higher ratings 
made by each of the boys on two tests are more reliable indices of their 
true possibilities than the one low rating made by each of the boys. 
This conclusion seems warranted by the fact that no child can make a 
higher score on a test than he is able to make. Copying or special 
practice on the material of the test are about the only things which 
enable a child to make an unwarranted high score on a test given under 
standard conditions. On the other hand, there are hundreds of things 
which may and frequently do prevent children from doing themselves 
justice on a test. For this reason one low rating on a test in the ab- 
sence of other supporting evidence is not worthy of as much confidence 
as a high rating. Because different tests make demands upon different 
phases of mental life, a child may be expected to make a low rating or 
a high rating according as each particular test strikes a weak or a 
strong point. Thus a child’s apparently inconsistent ratings on different 
tests may really be significant reflections of an unequally balanced men- 
tal life, or of unequally balanced mental tests. 

The preceding illustrations and partial analysis of the data pre- 
sented show (1) the gross errors that may be made by judging an indi- 
vidual upon the basis of only one group test, and (2) the ease with 
which records on all tests for each child may be kept permanently and 
used with considerable assurance to avoid the errors bound to occur 
when the results of only one test are available. 








Measuring the Budgetary Procedure of a School 
System 


HAROLD F. CLARK, Assistant Professor of Education, Indiana University 


Our problem may be somewhat beyond the orthodox limits of edu- 
cational measurements. Certainly the attempt to apply measurement 
to the field of the financial control of education meets with unusual 
difficulty. 

This group as a group is probably thoroly convinced of the pos- 
sibility of measurements in educational achievement and mental traits. 
It may tax your credulity to think of applying a similar technique to 
the problems of government. You see, the problem for education and 
government is one and the same in this attempt to apply measurements 
to the field of financial organization and control. 

Some thinkers in the general field of government have not despaired 
of the possibility of measurement. Mr. Buck, of the New York Bureau 
of Municipal Research, writing in the National Municipal Review for 
March, 1924, says: “Although many may not be willing to agree that 
the results of government are measurable, we are constantly evaluating 
governmental services or trying to evaluate them in some way or other. 
Measurements of work and measurements of results among other things 
may be mentioned. It may sound like a dream, but we must devise 
methods by which we can measure the results of government so the 
information will be accurate and conclusive.” 

Mr. Buck, to illustrate the possibility to his popular and probably 
somewhat hostile audience, gives some of the old illustrations with which 
you all are familiar. He reminds us that it was not so many hundred 
years ago that we had no accurate way of measuring temperature. If 
the doctor wanted to know if the patient had a fever, he felt him and 
said: “I think he has. He seems to have.” We are in the same con- 
dition in the financial administration of education today. We believe 
our way of preparing the budget or not preparing it is better than some 
other. It seems to us and we think just as the witch-doctor of old 
thought about the impossibility of measuring temperature. Some of us 
will say that it must always be so in the field of financial control of 
education. 

Fcr our critical friends who say it cannot be done, we will quote a 
statement of one of the greatest authorities in the field of measure- 
ment: “Indeed a case could well be made out for the thesis that the 
theoretical objections sometimes brought against social and political 
measurements really hold in the last resort against all measurement 
and ‘prove too much; and that the real difference between such measure- 
ment and physical measurement is simply that social and political 
phenomena, being practically more difficult to handle, force on our notice 
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the theory of knowledge difficulties inherent in all measurement, whereas 
in physical measurement familiarity has bred contempt.” 

If some of you are skeptical of the possibility of scientific measure- 
ment in this field, what will the critics of measurements and the skeptics 
of quantitative technique in general think? Some of our humorously- 
inclined friends might even say “We are willing to admit that ‘what- 
ever exists, exists in some quantity, and whatever exists in quantity . 
can be measured’ but we deny the existence of school budgets.” I am 
afraid in all too many cases we would have to admit they are right. 

Other settled-minded people might take the position Bergson states 
in speaking of more general things: “Reason does not proceed in such 
matters as in geometry, where impersonal premises are given once and 
for all, and an impersonal conclusion must perforce be drawn.” That 
really is the essential question—Can such matters as budgetary pro- 
cedure be put on an impersonal basis and therefore made scientific, or 
must they always remain matters of personal opinion? 

Some of you will think of the superintendent that followed a pro- 
cedure that would score about 0 on our score card, and yet was remark- 
ably successful in his financial affairs. To the minds of some that is 
the effective answer showing that measurement breaks down when ap- 
plied to the personal element because it may violate some of the rules 
and yet succeed. The case is just the other way; the attempt of science 
is to eliminate the personal factor and make the judgment perfectly 
objective. There may be a man who can guess the temperature more 
aecurately than the thermometer will register it, but we cannot build 
procedure around such chances. Measurement attempts to eliminate per- 
sonal factors as to what are the facts, then to allow all the freedom in 
the world as to what to do with them, until science advances further 
and makes the procedure part of the facts. 

The following work grew out of the attempt to find some kind of 
procedure by which the man out in the fiéld could check over his budget 
and improve it. That is the test we want to apply; if it succeeds in 
that it will fulfil its purpose regardless of the statistical beauty of the 
work. The attempt then was to find a process that would enable us to 
check budgetary procedure more or less objectively, with as little per- 
sonal opinion as possible. A survey of the possible methods to use soon 
convinced us that the score card was the best initial point of attack, 

Taking up the score card itself there are four questions to be 
answered: 

1. What are the elemerts composing the score card? 

2. How were the elements obtained? 

8. How was the value of each determined? 

4. How is the score card used? 


The first question, What are the elements composing the score card? 
is most effectively answered by turning to the score card itself. 
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Tentative Score Card 
Budget of a School System 
(Both procedure and results) 


Score on the basis of 100 points. 


I. Preparation. 


(6) 1. 
(7) 2. 
(6) 3. 
(5) 4. 
(4) 5. 
(4) 6. 
(5) 7. 
(5) 8. 
II. Voting. 
(4) 9. 
(4) 10. 


Who prepares estimates? 

How are the estimates of expenditure prepared? 

How are the estimates of revenue prepared? 

Universality—are all the items in the budget? 

Specialization—in how small divisions is the budget pre- 
pared? 

When—how near to date when budget begins? 

Time consumed in preparing. Are standard forms used? 
Is a budget book kept? 

Sincerity—are estimates padded or amounts given as 
needed? 


Provision for public discussion, hearings, before voted. 
How much do voting authorities change budget? Per 
cent? 


III. Execution. 


(6) 11. 
(6) 12. 


(3) 13. 


(4) 14, 
(5) 15. 
IV. Control. 
(5) 16. 


(4) 17. 
(5) 18. 
V. Special. 
(8) 19. 


(4) 20. 


Who executes, carries out provisions of the budget? 

Provision made for checking expenditures with budget 
estimates thru the year, preferably monthly. 

What are the provisions for handling deficiencies? (Trans- 
fers, additional appropriations, or appropriations for 
contingencies. ) 

Amount of borrowing necessary. (Per cent of budget.) 

Provisions for seeing that departments do not overspend. 

How completely is the fiscal perfod closed? Method (How 
are bills handled that are presented after the close of 
the fiscal year?). 

Per cent of surplus or deficit. 

What provision for audit? 


What is the cost of the budget, what savings does it make, 
what waste prevent? 

Ease with which publicity material be prepared from 
budget forms. 


Let us turn to the second question. How were the elements obtained? 
Most of the customary schemes were brought into use to determine the 
elements to be used. 


a. Scores of books and articles were analyzed to see what they 
suggested as being vital to good budgetary procedure. 

b. Opinions were obtained from leaders in the field and students 

: of the problem. 

ce. Job-analysis was carried thru to find the elements that make a 


good budget. 
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d. Finally the items were checked against budgets that were actu- 
ally working, witn the constant thought in mind: What are the 
factors that caused them to succeed? 


Special attention was paid to the factors that can be transferred to 
other situations; in other words, the impersonal factors. We are not 
becrying the personal element or belittling the study of these personal 
elements. We are simply not interested in the personal elements that _ 
can violate all the principles and succeed. We admit they are in this 
field as in all others. We are interested at this time in only those 
elements that can be applied to typical situations. 

The great need of the immediate future is to take these impersonal 
elements and try them out by an actual laboratory and experimental 
method. Picking the right elements with more certainty depends upon 
testing out experimentally the different plans with only one of the ele- 
ments varying. To the extent that these methods have been tried out 
with all conditions except one constant, we have had a laboratory and 
experimental method. It is needless to say that with so many factors 
at work it is difficult to tell which is the effective factor. Refinement 
of technique to determine this more accurately is what is needed. There 
are no insuperable, practical, or theoretical difficulties in determining 
such matters on the experimental basis. With every school man in the 
country working on one or two problems each year we would have a 
laboratory sufficient to solve this and many of our other problems. 

We shall pass to the third question, How was the value of each 
item determined? At present no other way seems to be feasible to 
determine the value of the elements of the score card but by judgment 
and opinion. Opinions of competent judges is the basis of the values 
assigned in this score card. Ultimately, by a long process of trying 
out with one factor different and others similar, differences in value of 
the different elements could be tested experimentally by the relative 
change in efficiency produced by each element. 

Let us consider the fourth question, How is the score card used? 
As suggested before, the reason for constructing any such devices as 
we are considering is to be of use in our school systems. It may 
be at first that such a score card will perform its best service as a 
checking device. It may be used to call the attention of the superin- 
tendent to the best practice. It is the combined and condensed judg 
ment of the best authorities. For use as a check list to call the super- 
intendent’s attention to the best practice and to get him to justify his 
own if it is different, the card should be of immediate value. 

Our main consideration this morning is not with the use but with 
more technical matters. Just a word concerning the use before we pass 
on. It will doubtless be said that the difficulty of obtaining some of 
the information will make the score card so difficult to use that it will 
not justify the effort. The very effectual answer is that there is some- 
thing wrong with the record system in that case and it would be wise 
to check over it. All the facts that are needed to use the score card a 
superintendent should know or have readily available. The score card 
should prove of some value for survey purposes in comparing one system 
with another. 
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However unreliable this instrument may be, certainly it is more 
reliable than to compare one system with another on the basis of judg- 
ment without any checking devices. The question to ask is, Is the pro- 
cedure better than what it will replace? 

Complete instructions for scoring will be found in the little manuai 
to go with the score card which will soon be issued. In it will be found 
the method of scoring each question, the conditions under which full 
value and partial value are to be given. Some indication of the reliabil- 
ity and validity of the instrument of measurement will be found therein. 

Having touched upon the possibility of such a score card and upon 
the card itself, we will take up some of the more technical points of 
criticism. 

The first point suggested by a considerable number of friendly 
critics is that we are attempting to measure two things that are not 
measurable at the same time. They tell us that both procedure and 
results cannot be measured in the same scale, that the material to be 
measured is not homogeneous, therefore there can be no unit by which 
to measure it. 

It is immediately admitted that the ideal course would be to measure 
the procedure and the result separately. The author hopes ultimately 
to do this unless someone else does it more efficiently than he is able to 
do. For the present time the practical thing to do seems to be to cover 
the entire field. That is the reason such an item as No. 19, which con- 
siders results, is included. Some will say that results are the only things 
in which we are interested, that they are the only things of value; 
measure them and let ail else go. 

In building up judgment scales, score cards, and similar devices, we 
have been told to use competent judges. Does anyone know what dif- 
ference it would make in a specific situation if the judges were not com- 
petent? In other words, what would be the difference -between expert 
and non-expert judgment? Taking the well known and valuable school 
building score card, would it have made any difference if the values had 
been assigned by non-expert judges? 

Let us take the hypothetical case where the value assigned by two 
groups of judges, one supposed to be expert and the other non-expert, 
are the same. What are the possible explanations? It is possible that 
no one knows the value to be given to the different elements, in which 
case they would all be guesses and the so-called expert and non-expert 
would come out the same. On the other hand, it is possible that the 
values are so obvious that everyone knows the relative value and for 
that reason expert and non-expert would assign the same value. There 
is a third possibility, that judgment does not determine the value, but 
the value has been controlled by statistical arrangement making expert 
and non-expert the same. 

Carrying the point a little further, for an illustration let us assume 
that we have one hundred expert and one hundred non-expert judges. 

1. Will the final value given to the different points be the same in 

each case? 

2. If the values are the same, has the process of arriving at the 

values been different? 
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3. Assuming further that the final values of the non-expert judg- 
ments are approximately the same, would it make the scale less 
valuable if it were based on the non-expert judgments? 


These may seem nonsensical questions, but I hope you see the im- 
portant point under them. Is the value determined by judgment or by 
mechanical arrangement and mathematical laws? 

Taking up the last of our three questions first, “Would the scale 
be less valuable if based upon non-expert judgment even tho the values 
were the same as- would have been assigned by expert judges?” We are 
safe in saying it would, at least to many people. 

The second question, “If the values are the same, has the process 
of arriving at these values been different?” We are safe in assuming 
again that the process has been different. Let us assume that our score 
card is made of twenty points that are exactly of equal weight. The 
expert judges will then assign five points to each, and a sufficient num- 
ber of non-expert judges would give an average of five points each from 
the laws of chance. The important question for us is some such prin- 
ciple determining the value assigned by even the expert judge. 

To return to our first question, “Will the values given by the expert 
and non-expert judges be the same?” In the illustration given above, 
they will. The only way to know what would happen in an actual situ- 
ation was to try it out with both expert and non-expert judges. That 
is what was done. We have a large number of judgments from recog- 
nized experts in budgetary procedure from the fields of business, govern- 
ment, and education. The final value as obtained from the expert judges 
was compared with the judgment of one hundred freshman and sopho- 
more college students. If the judgments of the college students were a 
pure chance affair and the number was sufficiently large, each of the 
twenty statements should have been given five points. The actual situ- 
ation was slightly different from this, the average deviation being about 
one-half of one point from five. The second fifty judgments did nothing 
to decrease the size of this average deviation from that obtained from 
the first fifty. You see from this how improbable it is that this devi- 
ation from the chance value is the fault of having only one hundred 
judges. It is probable that even these non-expert judges are convinced 
that one or two of the points are more valuable than the others and one 
or two less valuable. The same elements tend to hold the high values 
and the same ones the low values, right thru all the answers. It would 
tend to show there is some matter of judgment producing it. 

The further question might be asked, Would each element be given 
the value of five if we had judges who know even less of the subject 
than freshman and sophomore students? An answer to this question 
was obtained by having fourth and fifth grade children judge the value 
of each item. 

The average deviation of expert judgments from the chance value 
is more than twice that of the next best informed group. This does not 
prove that the experts are right, but at least the expert judges as a 
group were convinced that some points were much more valuable than 
others and so marked them. The answer to our question is that expert 
judgments do differ considerably from non-expert. 
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Introducing questions that have no value, or even negative value, 
is a good way to get light on some of the factors that are determining 
the value of questions. This affords a very good check on the ability 
of the judges. 

Reference has been made to the part which mechanical arrange- 
ment or mathematical law may play in determining the value of the 
different elements. It is a point that has been largely overlooked in the 
past. To determine its influence in one way, the list of items was scored 
by a large number of judges in the form in which it has been presented 
here today, that is, in the five large divisions. Another set of judges 
were given the twenty statements not divided into the five large divi- 
sions. You noticed one of the divisions had two items under it, another 
had eight. There was a tendency on the part of some of the judges to 
make the divisions of approximately equal weight. Naturally this in- 
creased the weight given to the items in the divisions having only a 
few items. 

Another good illustration of the effect mechanical arrangement has 
upon the score values can be obtained by combining or dividing ques- 
tions. There is considerable evidence to show that breaking up one ques- 
tion into two or more tends to increase the weight to that original item. 
The evidence we have shows this to be true to the full extent with non- 
expert judges. It is only natural if we put in another item that it will 
obtain its proportional part of the total weight from the non-expert. 
This seems to decrease with the expertness of the judges, and with suf- 
ficiently expert judges it would probably disappear entirely. This in- 
crease happens in a siight degree even with our most expert judges. 
The total evidence seems to show that mechanical arrangement and 
mathematical law control the value of items to a very considerable 
extent. 

The moral would seem to be to pick or arrange the eiements going 
into your score card with great care, and you can largely ignore your 
judges, altho the more expert the judges are, the better instrument you 
will have. This points out to us one of the advantages as well as one of 
the defects of a score card made up of more or less arbitrarily selected 
elements. We need more careful experimentation to determine the exact 
influence of these different factors in a final score. 

I would like to suggest one other matter to those interested in the 
more technical phases of the subject. There seems to be no difference 
in final value from ranking the elements in order then getting judg- 
ment units and assigning the percentage values directly. This may not 
be a fair test, but it raises the question how much difference in final 
ranking there would have been in handwriting or composition scales by 
the direct percentage method of scoring. 

This score card has not been offered this morning as a final solution 
of the problem of measuring the budgetary administration and control 
of a school system. On the other hand, it is distinctly tentative, and 
our most sanguine hopes are that it may be a checking device for those 
who want to go over the procedure and see where they can improve it; 
to have possibly a limited value in comparing one system with another, 
and finally to stimulate the bringing forth of better instruments. 
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To sum up what we have said: 


1. There are unusual difficulties in attempting to apply measure- 
ment to the field we have been discussing, but it can and must be done. 

2. As a beginning attempt, we have used a score card based upon 
expert judgment. Ultimately both the elements and their values must 
be determined by experimental and laboratory methods. 

8. A beginning has been made as to some of the characteristics of 
expert judgments and their differences from non-expert. The mechanical 
arrangement and make-up of a score card may play a very important 
part in determining the value of the different items even for expert 
judges. We have shown what some of these factors are and their influ- 
ence, but we need much more careful determination of their value. 








Improving Comprehension Ability in Silent 
Reading 


GROVER H. ALDERMAN, Professor of Elementary Education, 
Indiana University 


I. INTRODUCTION 


Last year at the Tenth Annual Conference on Educational Measure- 
ments, a preliminary report was made of a study entitled, “The Value 
of Certain Types of Drill Exercises on Comprehension” (1). Since the 
study which is presented today is a continuation, or an outgrowth, of 
the one presented last year, it seems advisable to spend a few minutes 
reviewing the previous investigation before any discussion of the present 
problem is undertaken. 


Il. A REVIEW OF THE PREVIOUS STUDY 


The aim of last year’s experiment, as stated at that time, was to 
measure the relative value of three types of exercises which were given 
for the purpose of improving comprehension ability in silent reading. 
The method used in selecting these three exercises was similar to that 
used by O’Brien (3) in his experiment in the schools of Illinois, in 
which he selected three types of exercises for the purpose of improving 
speed in.silent reading. In order to determine the types of exercises 
which should be used for improving comprehension ability, a careful 
and critical survey was made of all the scientific literature on silent 
reading which gave any concrete data bearing on the subject of com- 
prehension. 

Anyone who has attempted to make such a survey is impressed with 
the fact that up to the present time most of cur scientific investigations 
in the field of silent reading have had to do with such topics as eye 
movement, inner speech, non-readers, speed, etc., while very little atten- 
tion has been given to possibility of helping the child improve his 
ability to comprehend what he reads. A lack of investigations along 
this line may be due to two causes: first, most measuring scales which 
have been devised for measuring reading ability have been so con- 
structed that they measure speed more accurately than any other ele- 
ment in silent reading. Second, such studies as made by Judd (2) in 
the Cleveland schools and O’Brien (3) in the schools of Illinois have 
led some investigators to believe that if the speed of reading is in- 
creased, comprehension will increase in like proportion. This assump- 
tion we believe to be only partially true, but be that as it may, anyone 
who attempts to base an experiment in comprehension on the findings 
of previous experiments is confronted with a dearth of data. 

As reported last year, a survey of the available scientific literature 
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pointed to the fact that the following five elements were the most im- 
portant factors involved in comprehension ability in silent reading: 
intelligence, vocabulary, organization, rate, and reproduction. Stated 
negatively, the characteristics of that child who scores low in compre- 
hension on a standard silent reading scale are intelligence below nor- 
mal, limited vocabulary, lack of organization ability, slow rate of read- 
ing, and inability to retain and reproduce the important thoughts of 
the paragraph. After reviewing these five elements, one might con- 
clude that the chief factor in comprehension is intelligence. While data 
are available to substantiate this point of view, the fact still remains 
that we shall always have pupils with us, in all departments of our 
public schools, who are of only average ability; the problem of how to 
help this type of child improve his reading ability will always be with us. 

Of the five elements just enumerated, intelligence was disregarded 
for obvious reasons; and rate was omitted since the relation of rate to 
comprehension was answered by O’Brien in his experiment. The prob- 
lem which remained, therefore, was clear cut. What type of reading 
exercises should be given to the child to help him enlarge his reading 
vocabulary; teach him to select the important thoughts and organize 
them in a logical manner, in order that he may retain and reproduce 
them when the occasion demands? 

In order to obtain an evaluation of these three types of exercises 
when taught under normal classroom conditions, 1,500 children in the 
schools of East Chicago, Frankfort, Lebanon, Mishawaka, Huntington, 
and Indianapolis were taught by methods which we prescribed for thirty 
minutes each day for a period of six weeks. The data from this experi- 
mental group were then checked against the data obtained from 1,500 
children who were working under the same controlled conditions except 
that the teaching was done by traditional methods. 

The data which we obtained were outstanding. As measured by the 
Thorndike-McCall Scale, any one of the methods used was superior to 
the traditional method. From the relative point of view, however, the 
organization exercise was most effective, the retention exercise was a 
close second, and the vocabulary exercise was third, proving to be only 
approximately half as effective as either of the other methods. 

When this study was reviewed last summer in a graduate class in 
Elementary Supervision, the following problems were raised by mem- 
bers of the class: 

1. The average reading teacher does not use one method alone to 
the exclusion of all others, but uses a combination of all the best meth- 
ods. The question which naturally followed was, What would be the 
result if a combination of these three methods were used, each being 
given the amount of time which last year’s experiment seemed to justify? 

2. The length of time thru which the experiment ran was also dis- 
cussed. While it was recognized that marked improvements have been 
made during a short intensive period of training, this does not represent 
typical school conditions. In order to show the superiority of this type 
of teaching over the traditional method, it was felt that the experiment 
should continue over an entire semester. 

8. Is the size of the school system a factor which must be taken 
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into consideration when evaluating an experiment? Does the amount of 
previous training which a child may have in a particular skill effect 
the results obtained from an experiment of this type? The previous 
experiment failed to answer this question. 

4. If a method of teaching is really a valuable one, would it not 
be true that the teachers themselves would recognize this fact and con- 
tinue the method even after the experiment is over? Is it not true that 
a method to be effective must be acceptable to the classroom teacher? 
Would it not be valuable to learn what the teachers themselves thought 
of the value of this method of teaching as a means of improving com- 
prehension? 


III. THE PROBLEM FOR THIS YEAR’S EXPERIMENT 


The questions which needed investigation and which were not an- 
swered in our previous investigation are briefly stated as follows: 

1. How much improvement may be made in comprehension ability 
by using certain reading exercises for a period of thirty minutes each 
day thruout an entire semester? 

2. If-a method of teaching proves beneficial for the average child, 
will it also prove beneficial for the child whose reading ability is up to 
or above standard? 


IV. THE METHOD 


The method used in conducting this experiment is similar to the one 
reported last year. All expenses and clerical assistance was furnished 
by the Bureau of Coéperative Research. Eleven schools from Indiana 
and one school from Illinois coéperated. These schools are representa- 
tive of the smaller school systems of Indiana. They vary in size from 
475 to 3,450 population and have an average enrollment in the elementary 
schools of 300 pupils. The names of these towns in alphabetical order 
are as follows: Bainbridge, Bridgeport (Ill.), Hobart, Jonesboro, 
Knightstown, Liberty, Lowell, North Judson, Onward, Pendleton, Pitts- 
boro, and Thorntown. 

In order that the superintendents who were to supervise the experi- 
ment in their own schools might become familiar with the methods to 
be used, several conferences were held at the University so that each 
superintendent might be able to answer any detailed question which was 
not answered in the mimeographed bulletin which was sent to each 
teacher coéperating. The instructions contained the following informa- 
tion: Form I of the Thorndike-McCall Test was to be given in Grades 
4 to 8 in all schools codéperating, on the forenoon of September 17. It 
was suggested that these tests be given by the same person, preferably 
the superintendent. The pupils were not to be divided into a drill and 
non-drill group, as was done last year. The effectiveness of this type 
of teaching was to be checked against normal procedure as recorded on 
the Thorndike-McCall Scale. 

The mimeographed instructions which were sent to each teacher 
were carefully compiled in every detail. Each teacher was asked to 
select her reading material from those books which were available for 
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her grade, or the grade above. Since these exercises were given for 
the purpose of increasing comprehension, it was recommended that only 
factual material be used. Selections from geographies, histories, physi- 
ologies, etc., were especially recommended. 

In the organization exercise which was to be given on Mondays and 
Tuesdays of each week, the child’s problem was to select the important 
thought from each paragraph and organize these in a logical manner, 
around the central topic or problem. After the first half of the period 
had been given over to this type of work the teacher, with the aid of 
the class, developed the outline on the board in order that each child 
might receive additional suggestions and corrections. 

The procedure for the retention exercise which was given on Wednes- 
days and Thursdays of each week was briefly as follows: Each pupil 
was asked to read the selection thru carefully for the purpose of pick- 
ing out the essential facts. As soon as this was done, the class was 
given a rigorous test consisting of ten or eleven definite questions which 
the teacher thought were important. Besides this, one test each week 
was given to test the child over the important facts which had been 
read during the week. Thus each child read his lesson, knowing that 
he would be tested and graded on his ability to retain and reproduce 
essential facts. 

The purpose of the vocabulary exercise which was given on Friday 
of each week was an attempt to increase the child’s comprehension abil- 
ity by enlarging his reading vocabulary. As a means of doing this, the 
teacher was again asked to select her reading material from a wide 
range of subject-matter. ‘This was done to make sure that all new 
words would be introduced in their natural setting. Great care was 
taken to master the meaning of all new words which the child would 
have occasion to use in his actual. work with books of a content nature. 


V. THE Data 


TABLE I—GENERAL INFORMATION 


Sehools COUMRMIEE <a i wise iho Rie oe RR 12 
en SN ran cn. Cha cath atee ae ees teks 4-8 
re ee eae eee 75 
PEUUO. CE GUE is iid kocsis duacaseinmegs 72 
Average length of recitation .................. 30 
pe erat: a eee 1,933 

OS SESE PRR ees pp re 410 

ONG SP oe ee gee Peter oo - 422 

ONE FRR pe ee eae he ees 398 

re ag isc as mh COOKE ee erate Ba 402 

SS FRIES Se Peres Fee 311 


Table I gives the general information which is necessary in order 
to evaluate the study. As previously stated, there were 1,933 pupils in 
Grades 4 to 8 working under the direction of seventy-two teachers in 
twelve towns which had an average population of approximately one 
thousand. The average size of the class was twenty-six. The average 
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length of the recitation was thirty minutes, while the average time given 
over to the study period was twenty minutes. 


TABLE II.—RESULTS OF SEPTEMBER TEST 














NUMBER OF , SEPTEMBER GRADE 

Grape CLASSES | SrANDARD ‘T’ Score | ACHIEVEMENT 
OS AF Hake cea 15 39.6 | *36 2 | L 3 
SORE Ae 6 ae aca fete vagang 
eet apes SN 6d) go doe aga L 5 
oe bey sg | 14 | 6.0 | 9.9 | H 5 
Gist dtc cae 12 |. 69.6 | 0.0 | H 5 














* All grades below standard September 17, 1928. 71 of 75 classes below standard. 


Table II shows the number of classes in each grade, the standard 
in *T’ scores for the lower section of each grade with the scores we 
obtained as a result of our September testing. Under the caption of 
“grade achievement” will be found the reading level of the pupils when 
the experiment was begun. This indicates that Grades 6, 7, and 8 are 
of only fifth-grade standing, Grade 6 being of low fifth level and Grades 
7 and 8 being of high fifth level. 


Table III is the most significant table shown thus far. Under the 
September score it will be noted that the average score for all grades 
is below standard. When the tests were given the second time, Grades 
4, 5, and 6 were up to or above standard while Grades 7 and 8 were 
still below standard. In column three, under the title of “‘*T’ Scores 
Gained”, it will be noted that all grades have gained, but in a marked 
decreasing order. This may be more clearly understood by represent- 
ing the ‘T’ score gain in terms of semesters gained. Grade 4 gained 
most with three semesters’ gain while Grade 8 gained least with but 
one semester’s gain. 

Since the data presented for Grades 7 and 8 are quite contrary to 
the findings of O’Brien, it seems advisable to offer some words of ex- 
planation at this time. As might be expected, Grades 7 and 8 were in 
some of the school systems included as a part of the junior high school 
organization. In the five high schools which had this type of organiza- 
tion, reading was displaced by the subject of English. In such instances 
grammar was offered two days each week, spelling one day each week, 
and reading, of the literary type, was offered but two days each week. 
In these five schools very little gain was made, with the exception of 
one school in which the principal was especially interested in our ex- 
periment. In this particular school as much gain was made in this 
grade as in any others. Eliminating these five schools from our tabula- 
tions, the data show that Grades 7 and 8 each gained two semesters. 
This still leaves the general average for all classes in these two grades 
below standard. It might be interesting to note, however, that most of 
the classes in these two grades are continuing the work thruout the 
second semester with the hope of bringing all classes up to standard. 
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. 
TABLE IV.—NUMBER OF CLASSES BY GRADES WHICH SCORED UP 
TO STANDARD OR ABOVE 




















ates | Numperor | . ; 
FRADE Casgune | SEPTEMBER FEBRUARY 
4.. Lys J 15 1 13 
S.. cena 16 1 11 
RRR a Raga, 18 1 1 
RE PASS Er W oa. 14 0 2 
ee 12 2 4 
| | | 
| 
OR tae ee ae 








Table IV is given for the purpose of showing the number of classes 
which were up to standard before and after the drill work was given. 
It will be noted that out of the seventy-five classes which took part in 
this experiment, but five were up to standard when the tests were given 
in September. After the drill work was given, forty of the seventy-five 
classes were up to standard. An analysis of this type is helpful since 
it locates the classes which should receive special assistance from the 
supervisor. The supervisor, however, should not only be interested in 
knowing that each class is up to standard but should also be interested 
in what happens with each particular child. The supervisor should 
have some record which would show her the particular child in each 
grade which needs remedial exercises. 


TABLE V.—SHOWING NUMBER OF PUPILS IN EACH GRADE WHICH 
WERE BELOW STANDARD 














| | | 
Tora. NUMBER NUMBER 
GRADE NUMBER OF | BELOw BELOW 
| Pupms | SepremMper | FEesRvARY 
Beare Ae et | 410 20 147 
ee een cco ces Spee (ie 412 278 169 
EEE Se oat | 398 271 189 
) RS Ras suai val 402 325 219 
Wiens we | 311 | 234 182 
Le See wes | eee ee 








Table V shows the number of children in each grade which were 
below standard when the tests were given in September and also the 
number which were below standard when the tests were repeated in 
February. 
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TABLE VI.—NUMBER GAINED, LOST, REMAINED SAME 


























GRADE Tora GAINED SAME Lost 

4... eae ee bee 410 349 | 24 37 

| ay ere ee peer eee 412 301 34 77 
6.4.5 eVects es } 398 285 30 83 

7 Soe tp aa eee ieee 402 275 18 109 

| SER e Sets Se eee Pee 311 199 16 96 

eee ose 1,933 1,409 122 402 








Table VI shows even a closer analysis. Of the 1,933 pupils who 
took part in the experiment in reading, 1,409 gained, 122 were no better 
readers than at the beginning of the semester, and 402 actually lost in 
reading ability. 


TABLE VII.—GAINS BY QUARTILES 




















| | 
GRADE UpreR MIDDLE LowER 
b, hnd Fim Coos paw RRL eae oe te 3.87 6.29 7.52 
Go soe cet dean ee ely ee EL Eoe 4.64 4.81 6.08 
OE ccinnitull bak eb adlece vawlsh nd eee 4.10 4.37 7.57 
Facackay oh Cee pee eae cee ee hee 5.74 4.91 7.35 
De. sss co eek does as caC Penk aaae 2.75 3.09 7.29 
WSS ho Saab kctc ei eemss 4.22 4.69 7.16 











When it was discovered that 404 pupils were actually harmed by 
our method of teaching reading, it certainly threw out a challenge to 
us to make an attempt to locate these pupils and try, if possible, to 
find out why they were harmed in their reading ability. In order to 
do this the pupils of each class were arranged in descending order 
according to the scores which they earned in their September tests. 
The gains in ‘T’ scores which were made by each pupil and by quar- 
tiles were then tabulated. Table VII indicates that those which gained 
least were in the upper quartile in reading ability when the Thorndike- 
McCall Tests were given in September. 
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TABLE VIII.—NUMBER OF PUPILS WHO LOST BY GRADES AND BY 


























QUARTILES 
TOTAL 

GRADES Ql Q2 anv 3 Q4 Tora. Sen Guee 
A Pe 19 15 3 37 9.2 
Rigitesknces 23 48 6 77 19.1 
Wig cdkids asa 38 39 8 85 26.0 
Bis tsahs checks 45 58 6 109 27.1 
Ga 43 43 10 96 23.8 
168 203 33 404 mes 

41.7 per centi50.3 per cent| 8.2 per cent 21 per cent 











VI. CONCLUSIONS 


The conclusions which we shall make on this study are not general 
in nature. They apply only to teaching done under the conditions de- 
scribed in the present experiment. With this thought in mind, it seems 
safe to draw the following conclusions: 

1. Comprehension ability in silent reading as measured by the 
Thorndike-McCall Scale may be improved to a degree equivalent to two 
semesters in Grades 4 to 8, by careful, systematic drill work covering 
a period of one semester, provided that thirty minutes each day is 
devoted to this type of work. 

2. A teacher who is interested in improving comprehension ability 
in silent reading would be justified in using such type of drill work just 
described in this experiment. 

3. Before any attempt is made to improve comprehension ability 
in silent reading, children should be divided according to their reading 
ability since the type of training needed by one group may actually do 
harm to another. 


VII. AN EVALUATION 


In evaluating a study of this type one must keep in mind the con- 
ditions under which the experiment was performed. It is a typical 
example of the arm-chair type of supervision. None of the teachers 
were visited by me, no demonstration lessons were taught, all the super- 
vision that I was able to do was done by personal letters or bulletins. 
On the other hand, the superintendents codperated in a most loyal man- 
ner. Each superintendent considered the experiment as if it were his 
own particular problem. Teachers’ meetings were held, and the prog- 
ress of the experiment was discussed from time to time under the 
direction of the superintendent. 
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The Advantage of Ability Grouping 


CLIFFORD Woopy, Director of Bureau of Educational Reference and 
Research, University of Michigan 


THE main investigation which will be reported is an attempt to 
measure the effectiveness of ability grouping. It was attempted in an 
effort to throw some light on some of the baffling problems which have 
arisen in connection with the use of educational and mental tests in 
the public schools of Michigan. Three bits of evidence which have 
arisen from the use of these tests and which will portray the nature 
of the existing problems will be presented before discussing the main 
investigation. 

Table I, presenting the first bit of evidence, shows the median 
scores attained in October, 1921, by grades and by ages on the National 
Intelligence Test, Scale A, Form 2, in some 23 cities of Michigan. In 
this table, Group I refers to that group of cities having less than 5,000 
inhabitants; Group II, to that group of cities having from 5,000 to 
10,000 inhabitants; Group III, to that group of cities having more than 
10,000 inhabitants. 


TABLE I.—SCORES ON THE NATIONAL INTELLIGENCE TEST BY 
GRADES IN THE DIFFERENT GROUPS OF MICHIGAN CITIES 
atone A—FORM 2) 











GRADE Group I | Grovp II | Group u1| Au CiTIEes 
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SCORES ON THE NATIONAL INTELLIGENCE TEST BY AGES IN THE 
_DIF FERENT GROUPS OF MICHIGAN CITIES (SC ALE A—FORM #) 


























AGE | Group I | Gaon II | Grovp III m1 | ALL CiTIEs 
ee eat! 40.0 | 37.5 | 40.4 | 39.6 
ee esas Tike. cos NaS ae 3 > 1. ae: ae 
Sta 64.1 | 62.9 | 950.7 | 60.2 
Bis ee | 75.0 | 74 | 80.3 | 78.9 
ee . O08 | Re” .1 ee °) ee 
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ee | 107.5 115.9 126.1 | 125.0 
SRR EA | 9.0 | 118.1 | 122.5 
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Scrutiny of the upper portion of this table reveals that the scores 
attained were from approximately 2 to 8 points lower in the different 
grades in Group I than the scores attained in Group II; that the scores 
attained in Group II were approximately 1 to 4 points lower than in 
Group III. The scores in Group I were about 10 points lower than in 
Group III. Scrutiny of the lower part of the table reveals that the 
scores by ages were lower in Group I than in Group II and that those 
in Group II were in turn lower than in Group III. 

The one exception to this generalization is that the score for age 9 
was higher in Group I than in Group II and that in Group II was higher 
than in Group III. The cause of this irregularity is unknown. In gen- 
eral the scores on the National Intelligence Tests were lower in the 
smaller cities than in the larger cities. 

Table II, presenting the second bit of evidence, shows the levels of 
achievement in approximately the same group of cities on a spelling 
test based on the Ayres List, the Monroe Silent Reading Test, and the 
Courtis Supervisory Tests in Arithmetic. A survey of this table shows 
that in each grade with but one or two exceptions the level of achieve- 
ment in each of the subjects was higher in Group II than in Group I, 
and in Group III than in Group II. Thus the scores on the educational 
tests were lower in the smaller cities than in the larger cities. 


TABLE II.—STANDARDS OF ACHIEVEMENT IN SPELLING, MONROE 
READING, AND THE COURTIS SUPERVISORY ARITHMETIC 
IN VARIOUS GROUPS OF MICHIGAN CITIES 
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The finlings on the two types of tests were in very close agree- 
ment, but the explanation for the situations portrayed may be vastly 
different. On the one hand, the overzealous enthusiast for mental tests 
probably would explain the situation by saying that the larger cities 
have a selective power and that as a rule only the less intelligent and 
even the stupid are satisfied to remain in the smaller towns. The writer 
has heard this very insinuation made as an explanation of the above 
facts. On the other hand, the enthusiastic believer in the molding power 
of environment would disclaim that superior intelligence has anything 
to do with the situation and would lament the inequality of educational 
opportunity. He would point to the better trained teachers, better 
supervision, and better equipment in the larger cities, and argue that 
if these better educational advantages do not result in superior educa- 
tional results we had better fold our educational tents and quietly 
accept the doctrine of “Presbyterian predestination”. The writer has 
heard this comment given in explanation of the situation previously 
described. 

The third bit of evidence to be submitted before the introduction 
of the real problem of the effect of ability grouping represents an at- 
tempt to make comparison of the amount of improvement made on 
educational tests from October to May by pupils making identical scores 
on the National Intelligence Test in cities having less than 10,000 in- 
habitants and in cities having more than 10,000 inhabitants. From the 
large number of pupils taking the National Intelligence Tests in Octo- 
ber and the educational tests in both October and May it was possible 
to select approximately 100 pupils in each grade in the group of large 
cities whose scores on the intelligence tests were matched with the scores 
of a similar number of pupils in thé group of smaller cities. In each 
grade these two groups of pupils were each differentiated on the basis 
of the scores on the intelligence test into three groups: those making 
high, average, and low scores. 

Before exhibiting the tables showing the results achieved it should 
be said that all scores on the educational tests were reduced to “T’ scores 
after the plan outlined by McCall,* with the exception that the ‘T’ 
scores were developed from the distributions of scores for each grade 
obtained from the state-wide testing programs rather than from the 
distributions of the scores made by twelve-year-old pupils. Since the 
development of the ‘T’ scores was not on the basis of the scores made 
by twelve-year-olds, the scores between different grades are not com- 
parable, but all comparisons of scores within a grade are valid. By 
reducing the raw scores obtained on the different tests to “T’ scores a 
unit of improvement on any portion of the scale was made equivalent 
to a unit of improvement on any other portion of the scale. This step 
of reducing was necessary because, as will be noted later, the initial 
level of achievement in the larger cities was higher than the initial level 
of achievement in the smaller cities. Furthermore, the initial level of 
achievement in the high sections was usually higher than in other sec- 
tions. Thus it is possible to compare not only the achievement of the 


+ McCall, W. A. How to Measure in Education, Ch. X, pp. 272, 807. Macmillan 
Company, New York, 1922. 
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total number of pupils in each grade in the two groups of cities, but 
also the achievement of each of the three sections within each grade. 
Furthermore, it is possible to compare the amount of improvement 
made by these three sections in the large or in the small cities. 

Tables III to VI, inclusive, present the amount of improvement on 
the spelling test, the comprehension scores of the Monroe Silent Read- 
ing Test, and the number of problems attempted on the Courtis Super- 
visory Tests in Arithmetic, and the number of problems correctly solved 
on the Courtis Supervisory Tests in Arithmetic, respectively. In each 
of these tables the first column indicates the particular sections into 
which the grade was divided; the second column, the number of pupils 
in each grade and section within a grade; the third and seventh columns, 
the average scores on the National Intelligence Test in each grade and 
section in the group of small and large cities; the fourth and fifth, and 
eighth and ninth columns represent the October and May ‘T’ scores on 
the educational tests in the different grades and sections of grades in 
the group of small and large cities respectively; the sixth and tenth 
columns represent the gains in ‘T’ scores on the educational tests result- 
ing from a year of instruction in the group of small and large cities, 
respectively. 

Table III, as typical of this group of tables, will be interpreted in 
detail and but slight reference will be made to the other tables as they 
merely emphasize the same points. From this table the following points 
are manifest: 

1. The matching of pupils in the two groups in the large and small 
cities on the basis of the scores on the National Intelligence Test was 
almost perfect. The matching in the sections within the two groups was 
also very close. 

2. The number of pupils in the different grades varied from 69 
in Grade III to 81 in Grade VIII. The three sections within each grade 
contained approximately the same number of pupils. 

8. The level of achievement in the cities having less than 10,000 
inhabitants was considerably lower than that in the cities having more 
than 10,000 inhabitants. This was especially true in the October results 
and was noticeable in the May results for Grades III, IV, and V. 

4. There seemed to be little relationship between the size of the 
city and the gains made in the different grades. Greater gains were 
made in the total groups in the cities of less than 10,000 inhabitants in 
Grades III, VII, and VIII, and greater gains in the cities of more than 
10,000 inhabitants in Grades IV and VI. This fact is emphasized by 
comparing the gains made by pupils in a corresponding section of a 
grade. 

For fear that someone will assert that the gains in the large and 
small cities were on different levels of achievement and therefore not 
comparable, it must be recalled that all of the original scores in this 
set of tables were reduced to ‘T’ scores and that all units of gain on 
this basis are equal regardless of the position in the distribution of 
scores in which the gain occurs. 

5. Comparison of the gross “T’ scores made in the different sec- 
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tions shows that the high sections almost always made higher scores 
than the middle sections and that the middle sections almost always 
made higher scores than the low sections. This suggests a positive rela- 
tionship between the level of ability manifested on the mental test and 
the level manifested on the educational tests. However, comparison of 
the gains made in the different sections of a grade in either the group 
of large or small cities shows no definite relationship between the 
median scores on the mental test and the gains manifested on the 
spelling test. In Grade III in the group of smaller cities the average, 
or middle group, showed greater gain than either the high group or 
the low group, but the low group made greater gain than the high 
group; in the group of large cities the low group made greatest gain, 
the average group next greatest gain, and the high group the least 
gain. In Grade IV in the small cities the low group made greatest 
gain, and the middle group least gain; in the group of large cities, the 
middle group made the greatest gain and the high group the least gain. 
In other grades similar conditions existed. In general this table shows 
of the five chances to achieve greatest gain, the low group in the group 
of small cities was successful twice; the middle group twice, and the 
high group, once. In the group of larger cities the low group was suc- 
cessful twice; the middle group, twice; and the high group, once. 

Table IV, showing similar facts concerning the scores and gains 
on the rate scores of the Monroe Silent Reading Test, reveals about the 
same conditions as in Table III. It is interesting to note that greater 
gains were made in all grades in the groun of cities having less than 
10,000 inhabitants. Out of the five chances for making superior gains 
in the smaller group of cities, the low section was successful, once; the 
middle section, four times. In the group of larger cities the low section 
was successful twice, and the high section three times. 

Table V shows similar data on the number of exercises attempted 
on the Courtis Supervisory Tests in Arithmetic. It shows that on the 
number of problems attempted greater gains were made in all grades 
save Grade IV in the group of smaller cities. 

Comparison of the achievement in the different sections of the vari- 
ous grades shows that in the group of small cities the high group made 
greatest gain in three of the five grades considered; the middle group 
made greatest gain once; and the low group once. In the group of 
larger cities, the middle group made greatest gain twice, the low group 
once, and the high group made the least loss twice. It is rather sig- 
nificant that in Grades VII and VIII, the pupils in the larger cities 
actually attempted fewer problems in May than in October. 

Table VI, showing the data for the number of exercises correctly 
solved on the Courtis Supervisory Tests, reveals that in three of the 
grades greater gains were made in the smaller groups of cities. In the 
smaller cities, the higher sections made greater gain than the other 
sections in four of the five grades under consideration and equaled the 
gain made in the low section in the other grade. In this group of cities 
the low sections made least gain in all five grades. This is the only 
table which shows any resemblance of relationship between the intelli- 
gence and the achievement levels. In the larger cities this generaliza- 
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tion does not hold, for the middle section shows greater gains in two 
grades, the low section in one grade, and the high section a smailer 
loss in two grades. 

Other similar data are available, but sufficient have been presented 
to establish with a fair degree of certainty the following points: 

1. Widely different results were obtained from two groups of chil- 
dren perfectly matched on the basis of their scores on the National In- 
telligence Test. To state the point in another way, much variation in 
achievement existed even when intelligence as a determining factor was 
eliminated. Whether this variation would have been greater had the 
matching been on the basis of the 1.Q. or the index of brightness instead 
of the raw score is not known. 

2. There was no constant relationship between the gains in achieve- 
ment of groups of the same mental level in cities having less than 10,000 
inhabitants and in cities having more than 10,000 inhabitants. In 
slightly more*than half of the grades and sections the group in the 
smaller cities made more gains than the groups in larger cities. To the 
mind of the writer this condition was brought about by a universal 
attempt on the part of the smaller cities to improve their existing 
standards of achievement. General familiarity with the teaching situ- 
ation within the different cities convinces the writer that greater effort 
was put forth in the smaller cities than in the larger cities. 

3. If evidence justifies these first two points, then it justifies the 
statement that there is no substitute for good teaching. Without good 
teaching and proper adjustment of instruction, the achievement of bril- 
liant classes will be disappointing, and with good teaching, even with 
less brilliant pupils, the achievement will be surprisingly great. 

It might be said that the lack of agreement in intelligence and 
achievement has resulted from the failure to make suitable adjustments 
in instruction to the mentality of the groups of children. It might also 
be suggested that these results were due to the fact that these groups 
of children were selected at random from all of the children taking the 
tests in the large and small cities. The children were not taught by 
any particular teachers; in fact the pupils were not grouped as distinct 
classes and no attempt was made to control the conditions under which 
they were taught. Possibly there would have been different results had 
the investigation been intensive rather than extensive and had the con- 
ditions been rightly controlled. 

However, it is ventured that the conclusions would have been the 
same. Evidence substantiating this venture is offered in the fourth or 
what has been termed the main investigation. 

This fourth investigation was prompted by a desire to measure, 
if possible, the effect of ability grouping by means of experimental and 
control groups. Hitherto differentiation has been evaluated on the basis 
of experimental groups without consideration of control groups. This 
investigation, as planned, was to measure the practice of differentiation 
as actually carried on in numerous of our present-day schools; i.e., the 
practice of giving a single intelligence test and differentiating the pupils 
into two, three, or four groups solely upon the basis of the scores made. 
This investigation was carried on in seven different cities in Michigan 
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during the first semester of the 1923-24 school year in connection with 
the instruction in English and algebra in Grade IX. In all, almost 1,000 
pupils just entering high school participated in this investigation, but 
the present report will deal with but 51 pairs of pupils in a single city 
differentiated for instructional purposes in English. The results from 
other pupils would be included, but the tabulations have not been com- 
pleted. 

The general plan of the investigation was to differentiate the pupils 
just entering high school on the basis of the Terman Group Test of 
Mental Ability, Form A, first of all into experimental and control groups 
and then to subdivide each of these groups for instructional purposes. 
The method followed in making the differentiation into experimental 
and control groups was to place the pupil making the highest score on 
the Terman Test in the experimental group; the one making the next 
highest score in the control group; the one making the next highest 
score in the control group; the one making the next highest score in 
the experimental group, etc. By the utilization of this method of dif- 
ferentiation the two groups of pupils were almost identical so far as 
scores on the mental tests were concerned. As stated before, no othe: 
factors than the scores on the mental tests were considered in the dif- 
ferentiation because this. practice is being carried on extensively in 
many of the public schools, and there was a device to evaluate the 
practice scientifically. 

In the subdivision of these two main groups into smaller groups 
for instructional purposes, the experimental group was divided into 
either two or three sections on the basis of the scores on the mental 
test. Those making high scores were placed in one section, and those 
making low scores were placed in another section. If the number of 
pupils was sufficiently large a third section was formed in which the 
pupils from the middle of the distribution of scores were placed. The 
control group was divided for instructional purposes into the same num- 
ber of sections as the experimental group, but in the differentiation of 
the control group the principle of random selection was involved, and 
absolutely no attention was paid to the scores on the mental test. Thus 
in each section of the control group there was an approximately equal 
number of pupils making high, low, and medium scores on the mental 
test. This latter method of differentiation is not far different from that 
utilized by many school systems in making up the sections from the 
large number of pupils just entering high school. 

In each of the different cities the same teacher taught both the 
experimental and the control sections, and thus the teacher variable 
was eliminated so far as teaching ability was concerned. All teachers 
were told to push the various sections as fast as possible by giving 
them all they could do in the regular tests or by assigning plenty of 
supplementary materials. The teachers were told to exercise their own 
judgment with regard to the rate and amount of work to be done, but 
were urged to move at a rate consistent with effective teaching. 

The effectiveness of the plan of differentiation was measured by 
means of standardized educational tests given at the beginning and end 
of the semester. In the classes in English the Van Wagenen Reading 
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TABLE VII.—THE ACHIEVEMENTS OF THE EXPERIMENTAL AND 
CONTROL GROUPS 
Toran | Group | Group | Group 
Groups I II III 
AGE 
We, 5s sae ss da ice se tae 14.1 14.2 14.2 13.9 
Po REE ime Be eA SF en AR 14.1 14.4 14.0 13.9 
Superiority of Experimental.......... 0 —.2 2 0 
TERMAN INTELLIGENCE— 
WING. ones Meany Coase: 94.6 68.4 90.8 | 124.5 
CUNO a ss Gi vgn ae aoe meee ea da eee 92.2 69.9 89.5 | 117.2 
Superiority of Experimental... ....... 2.4 | —1.5 1.3 7.3 
Van WaGENEN— 
First Test 
PONE Ss 65 oN 6 ies ia 68 74.4 63.6 75.6 83.9 
CI oo ils F dic w oat ee oe ete as 74.7 69.0 74.8 80.3 
Superiority of Experimental... ....... —.3 | —5.4 8 3.6 
Second Test 
Experimental. ...................... | 80.9 | 71.1 | 81.5 | 90.1 
GMINNE Aas das. eatin, btn Se ob ea 80.2 73.3 80.8 86.4 
Superiority of Experimentai.. ...... 7 | 2.2 7 3.7 
Gains 
| Ee. eS ee oe F 6.5 7.5 | 5.8 6.2 
Ci he ul eist yon Ci ts Awan pol 5.5 43 | 6.2 6.1 
Superiority of Experimental.......... 1.0 3.2 —.4 1 
Brices— 
First Test 
ee ECTS a Fe es One | 22.4 4 20.5 25.4 
RIBS iss 0 oss ot anne haath sed aes | 22.3 2 21.7 23.8 
Superiority of Experimental.......... 1 1 | —1.2 1.6 
| 
Second Test | | 
pS ad ars, <i eR gehen | 27.7 | 25.9 26.3 | 30.9 
SUI ido vases ncotan chasis peer aeake’ | 26.1 | 25.6 | 25.1 | 27.4 
Superiority of Experimental.......... | 1.6 3 1.2 3.5 
Gain Ngee LSP Ee RY a ee BRO py ae 
Experimental............ Spas Meh 4.6 5.8 5.6 
| REP onb Wh ene as 3.8 4.4) S28 BS 
Superiority of Experimental... . 1.5 » au ihe Se ele 
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First Test | 
MONEE, 7. aso Sos cnean ec aea eae 17.9 12.1 20.6 20.9 
Control......... PEE a Ble! Soars 13.2 12.7 13.1 14.7 
Superiority of Experimental... .... 4.7 A 7.5 6.2 
Second Test | 
Mimebebitel: 23.565. is 25.5 | 20.4 | 25.4 | 30.6 
a! Rea Serer bo 23.38 | 19.7 22.1 28.1 
Superiority of Experimental.......... 3.3 | 7 3.3 2.5 
Gain | 
Experimental. . ... Maer ree: Sta.) Sn ee 4.8 9.6 
Control... Neh eesti ct Mees ae 9.0 | 13.4 
Superiority of Experimental... .......| —-2.8 | —.4 |—4.2 | —3.8 
ee — | | 
Krrspy SENTENCES— | 
First Test 
Experimental. . err eV ire eet te 29.2 25.5 29.6 32.5 
Control.... apepa tae eSE 23.5 23.8 23.1 23.8 
Superiority of Experimental.......... | 5.7 7 oe ta a 
Second Test | 
Experimental. . ... vesseeess| 88.5 | 38.3 | 32.4 | 34.9 
Control een |} 30.0 | 31.2 | 32.6 
Superiority of Experimental. . . veo oF 1 63.) 49 2.3 
Gain 
GN, oa ie is we dec ebae edo ee» 2 ae ee | 2.4 
Control........... ir+os wih GO Sa? ee 9.0 
Superiority of Experimental......... .|—4.8 |—2.4 | —5.4 | —6.6 











Scale in English Literature, the Briggs English Form Test, and the 
Kirby Grammar Test were given both times. In the algebra classes 
the Hotz Algebra Scales, Series A (Equation and Formula, and Prob- 
lem Scales) were given at the end of the first semester. Since the 
pupils had not had algebra it was not deemed necessary to give these 
tests at the beginning of the semester. In addition to these standardized 
tests the teachers gave frequent written examinations. These examina- 
tions, constructed to be as objective as possible, were given to all pupils, 
and a very complete record was filed. The teachers made careful ob- 
servations of the daily recitations, the amount of subject-matter cov- 
ered, and of the outstanding characteristics of the different sections. 
In fact, the teachers did everything possible to give this method of 
differentiation a fair chance and a fair evaluation. 

In making the final tabulations, the records of only those pupils 
who had taken the required tests and who had been regular in attend- 
ance during the semester were considered. It need hardly be said that 
the final matchings of the sections of the experimental and control 
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groups were made after the records for the first semester had been 
handed in. At that time each pupil having a particular age and in- 
telligence score within each section of the experimental group was paired 
with a pupil having a similar age and intelligence score within the 
control group. In this way age and intelligence as factors were elim- 
inated and the influence of differentiation was evaluated in terms of 
the achievements on the standardized educational tests or in terms of 
the teachers’ judgments of the nature of the work done. 

Table VII presents comparison of the ages, scores on the mental 
test, the averages of the initial and final scores on the Van Wagenen, 
Briggs, and Kirby tests, a statement of the gains made on each of the 
tests, and a statement of the superiority of the gain of the experimental 
group over that of the control group. From this table a number of 
very interesting facts are evident. 

1. The average ages of the experimental and control group were 
identical. In both groups the average of Section I was greater than 
that in Section II and that in Section II was greater than in Section ITI. 
This means that the younger pupils were the more brilliant. 

2. The average score on the Terman Mental Test was approxi- 
mately two points higher for the experimental group than for the con- 
trol group. In the different sections the average score on the mental 
test was slightly lower in Section I and slightly higher in Sections II 
and III in the experimental group than in the control group. However, 
the differences in these scores were so small that they are of no con- 
sequence. ; 

3. On the educational tests, almost without exception, on both the 
first and second tests the average score for Section I was lower than 
the average score for Section II, and the average score. for Section II 
was lower in turn than that for Section III. This fact is interesting, 
but it is in no way a proof that differentiation produces superior re- 
sults to non-differentiation, as is often indicated in current periodical 
literature. 

4. On the first tests, even tho the experimental and control groups 
were identical in averages of ages and mental scores, the experimental 
group was slightly inferior on the Van Wagenen Test, slightly superior 
on the Briggs Test, and considerably superior on the two phases of the 
Kirby Test. This indicates that children of the same age with the same 
scores on the mental tests may have different scores on the achievement 
tests. Within the different sections of the two groups, the sections of 
the experimental groups had slightly superior scores save in two grades 
in which the control groups were somewhat superior. 

5. In the gains made during the semester the experimental group 
as a whole was somewhat superior on the Van Wagenen Test and the 
Briggs Test, but the control group was superior on both phases of the 
Kirby Test. In the different sections, the experimental group was 
superior in the amount of gain made in Sections I and III and the 
control group in Section II; on the Briggs Test, the experimental group 
was superior in all sections in the amount of gain; on both aspects of 
the. Kirby Test the control group was superior in the amount of gain 
made in all three groups. Altogether out of the 12 different chances 
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given for superiority in the amount of gain made, the experimental 
group or sections thereof were superior 7 times and the control group 
5 times. Thus on this basis of tabulation there was no outstanding 
advantage for either group or for similar sections within the groups. 
It should be pointed -out, however, that this generalization is made on 
the basis of the gains computed from raw scores instead of ‘T’ scores. 
Whether the reduction of the raw scores to ‘T’ scores would have made 
any difference is not known, but when the data from the 7 cities are 
tabulated this reduction will be made. 

Table VIII provides corroboratory evidence. This table is based 
upon the records and observations made by the teacher of these pupils. 
This table exhibits distributions of the semester marks given to the 
pupils in the experimental and control groups as wholes and by sec- 
tions. It also shows the distribution of the marks given each group or 
section on daily recitations, book reports, monthly tests, written themes, 
and spelling. These enumerated items were all component items upon 
which the semester grades were computed. 


There are two outstanding characteristics of Table VIII: 

1. On each item the marks obtained in both groups by the pupils 
of Section I were lower than those obtained by the pupils of Section II 
and the marks obtained in Section II were in turn lower than those 
obtained in Section III. The situation with regard to the marks ob- 
tained is similar to that with regard to the scores received on the edu- 
cational tests, but here again it must be insisted that this condition does 
not of itself justify differentiation. 

2. On each item distributions of marks for the experimental and 
the control groups or for comparable sections of each of these groups 
were almost identical. The central tendencies of different distributions 
show virtually no difference on the achievements of the two groups. On 
some items the experimental group manifested a slight superiority in 
the marks obtained, but such superiority was neutralized by the superior- 
ity of the control group on some other item. These findings were almost 
dumbfounding to the teacher of these pupils for she was especially 
pleased with the work of differentiated groups and had reported that 
the accomplishment in the differentiated sections of the experimental 
groups was very superior to the work in the sections in the control 
group. She had been fooled by the lack of comparative evaluation of 
the achievements of the brilliant and slow pupils who were in the same 
sections of the control group, and it is reasonable to think that the 
level of achievement in these sections would not be so high as in the 
high section of the experimental group. However, when the accom- 
plishments of the pupils in the high sections of the experimental group 
were matched with the accomplishments of pupils having the same age 
and intelligence scores, the difference in achievements vanishes and it 
seems that the apparent differences were not real. 

Table IX presents a comparison of the accomplishments of the ex- 
perimental and control groups in a slightly different manner from that 
reported in Tables VII and VIII. In this table a record is given which 
shows the number of pairs of the matched pupils in which the pupil in 
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the experimental group or section is superior to the pupil in the control 
group or section. 

Since there are 51 pairs of pupils under consideration, the experi- 
mental group must be superior in 26 pairs to have been superior in 
more than half of the cases. Observation of the last vertical column 
of the figures in Table IX reveals that only on the Briggs Test did the 


TABLE IX.—NUMBER OF PAIRS IN WHICH PUPILS IN EXPERI- 
MENTAL GROUP WERE SUPERIOR TO PUPILS IN CONTROL 
GROUP IN GAINS MADE AND ON TEACHER’S JUDGMENT 





| 
| | 
Group Group | Group 








GAINS ON 1 io | _ Tora. 
| | 
Van Wagenen .. 9 7 | 25 
Briggs 10 13 1 ee 
Kirby— 
Principles 11 4 5 20 
Sentences...... 6 3 3 12 
Semester Grades 4 9 4 17 
Average Daily Grades 3 9 8 | 2 
Monthly Tests 4 9 7 20 
Book Reports . 6 6 ee aS 
Themes 5 8 eo) oe 
Spelling ie ee. ae Ce. | 16 
Number of Pairs 17 ie eee ae ee 








pupil in the experimental group excel the pupil in the control group 
in more than half of the pairs of matched pupils. This means that so 
far as the total groups are concerned more than half of the pupils did 
better when differentiation on the basis of intelligence tests was not 
applied. Examination of the reports for the different sections reveals 
that in Section I (the low section) on the Van Wagenen, Briggs, and 
the principles aspect of the Kirby Test, the pupils in the experimental 
group were superior to the pupils in the control group in more than 
half of the pairs, but on the items involving the teacher’s judgment 
the pupils in the contro! group were superior in considerably more than 
half of the cases. In Section II the experimental group was superior 
in more than half of the pairs on those items involving teacher’s judg- 
ment but was inferior in more than half of the cases on the Van 
Wagenen and Kirby Tests. In Section III the pupils in the experi- 
mental group were superior in more than half of the pairs on the Van 
Wagenen and Briggs Tests and on the item “book reports” and inferior 
in more than half of the pairs on the other items. In general, out of 
the 12 chances on the educational tests the pupils in sections of the 
experimental group obtained superior results in more than half of the 
pairs, six times; out of the 18 chances on the items involving the teach- 
er’s judgment the pupils in the sections of the experimental group were 
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superior in more than half of the pairs, only four times. Thus on the 
basis of the results of the educational tests there was no outstanding 
advantage with either group, but on the basis of those items upon which 
the teacher’s judgment was involved there was an advantage in favor 
of the control group. Whether these same conclusions will be reached 
when the data from the other cities participating in the investigation 
are tabulated is not known. 

It is to be regretted at this time that a definite record of the actual 
amount of subject-matter covered by the different groups and sections 
is not available to the writer, altho he ‘was assured that there was little 
difference in the amount of subject-matter covered and that the out- 
standing difference between the work of the sections was that the dif- 
ferentiated sections of the experimental group were more easily taught, 
and that they had a higher degree of mastery of the subject-matter. 
The evidence given throws considerable doubt on the validity of this 
last assertion, but there is no evidence at hand to cast suspicion on the 
assertion that the differentiated sections of the experimental group were 
more easily taught than the sections of the control group. If this point 
is established a case is made for differentiation on the basis of the 
scores on a mental test regardless of whether superior results are 
obtained. 


GENERAL CONCLUSIONS 


To the firm believer in individual differences, the facts presented 
in the third and fourth investigations described in this report present 
a challenge and suggest that investigations are needed to determine 
whether the tests and judgments rendered in this investigation are ade- 
quate for measuring the effects of the practice of differentiation on the 
basis of the scores on a single mental test. Probably what is more im- 
portant is that investigations are needed to determine just how to adjust 
the curriculum to the different groups after differentiation has taken 
place. In all probability the failure to make proper adjustments in 
materials and methods of teaching is responsible for the results obtained 
in the third and fourth investigations. Thus the failure to find that 
groups differentiated into sections of like ability make greater gains as 
sections or as individuals is not so much a condemnation of the practice 
of differentiation on the basis of a mental test, or for that matter on 
the basis of several tests, past records, and teachers’ judgments, as it 
is a condemnation of what took place after the differentiation. This 
suggests that the problem for the future is not the condemnation of the 
practice of differentiation according to ability, but is the baffling and 
perplexing task of how to readjust the curriculum and methods of teach- 
ing so the possibilities offered by differentiation may be fully realized. 











Results from Successive Repetitions of Certain 
Arithmetic Tests 


CLIFFORD Woopy, Director of Bureau of Educational Reference and 
Research, University of Michigan 





INTRODUCTION 


THERE are four stages in the final acceptance of any great move- 
ment: (1) the period of reluctant acceptance in which much hesitation, 
doubt, and even ridicule is manifested; (2) the period of enthusiastic 
acceptance which is characterized by blind and uncritical indulgences; 
(3) the period of partial rejection which results as a natural reaction 
to the over-enthusiasm and blind acceptance of the former period; (4) the 
period of critical analysis and evaluation in which the real worth of the 
movement is determined. 

The movement for measuring has passed thru all of these stages. 
About thirty years ago Rice was greatly ridiculed when he proposed at 
the meeting of the National Education Association that he could deter- 
mine the efficiency of the teaching of spelling by ascertaining how well 
the pupils could spell. He was the first to utilize extensively the com- 
parative test. But the idea gained little headway before 1915. From 
1915 to 1922 was a period of enthusiastic acceptance as is evidenced by 
the fact that the number of standardized tests increased from scarcely 
a dozen to approximately 300, if the number listed in the advertising 
material distributed by one of our large publishing houses is reliable. 
After this period of enthusiastic acceptance and at times blind and 
uncritical utilization of the tests, came the period of partial rejection. 
Three years ago at the Chicago meeting of the National Education 
Association numerous papers were read urging the intelligent use of 
the tests and pointing out the pitfalls and abuses observed in the current 
practices with the tests. Many people left that meeting with the con- 
viction that the testing movement was a failure. But the staunch sup- 
porters of the movement did not loose faith. They left cognizant of the 
fact that the years ahead were to be years of critical examination and 
analysis. Certainly the outstanding characteristic of the testing work 
of the past 2 years has been the critical examination, analysis, and evalu- 
ation of tests and testing technique. In this period many investigations 
of the reliability and validity of tests, or of the amount of error in- 
volved in experimental and statistical techniques, have been made. 

The present study dealing with arithmetic tests was undertaken to 
aid in the critical analysis and evaluation of tests and testing technique. 
The investigation was attempted to ascertain the following facts: 

1. The improvement resulting from the repetition on 4 successive 
days of the Woody-McCall Test in Mixed Fundamentals of Arithmetic, 
the Courtis Supervisory Tests in Arithmetic, and the Courtis Research 
Tests in Arithmetic, Series B. 
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2. The relative difficulty of the different forms of each of these 
tests. 

3. The reliability of the different tests. 

4. The relationship of success on one of the tests to success on the 
other tests. 

5. The reliability of the “Educational Quotient” technique. 


THE EXPERIMENTAL TECHNIQUE 


Brief Description of the Tests. The Woody-McCall Tests in the 
Mixed Fundamentals of Arithmetic consist of 35 exercises in each of 
the fundamental operations of arithmetic. These exercises, chosen from 
the various types of exercises in addition, subtraction, multiplication, 
and division, form a cross-section of the course of study in arithmetic 
and are arranged so that the processes are “scrambled” and so that the 
exercises are arranged according to difficulty. The time allowed the 
children in taking the test is 20 minutes, but the time element is not an 
essential characteristic of the test since most of the children finish 
before the expiration of the time allowed. This test is thus a “difficulty” 
test or what is sometimes known as a “power” test, altho this latter 
name usually is applied to tests involving reasoning problems. 

The Courtis Supervisory Tests in Arithmetic are divided into 2 
tests: Test A for Grades IV and VB, and Test B for Grades VA to 
VIIIA, inclusive. The 2 tests are similar in construction save that the 
exercises in Test B are more difficult. Test B is made up of 25 exer- 
cises, distributed among the fundamental processes as follows: 


8 exercises in addition involving the adding of 5 two-place numbers. 

8 exercises in subtraction involving the taking of a four-place num- 
ber from a four- or five-place number. 

5 exercises involving the multiplication of a two-place number by 
a two-place number. . 

4 exercises in division involving the division of a four-place number 
by a two-place number. 


The time allowed the children for taking Test B is 6% minutes in 
Grade VA and 6 minutes in Grade VIA. It should be added that the 
time allowed is different for each grade and half-grade with either 
Test A or Test B. Consequently it is clear that this test is a “time” 
test. 

The Courtis Research Test in Arithmetic, Series B, is a test in 
each of the 4 fundamental processes of arithmetic, but only the tests 
in addition and multiplication were used in this experiment. The test 
in addition consists of adding in 8 minutes as many as possible of 24 
exercises each containing 9 three-place numbers. The test in multiplica- 
tion consists of multiplying in 6 minutes as many as possible of the 24 
exercises, each containing a four-place number to be multiplied by a 
two-place number. Since there are many more problems in each of 
these tests than the children can solve in the time allowed, the tests 
are rightly termed “time” tests. 

There are 4 forms of each of these 3 tests and these different forms 
are supposed to be equal in difficulty; at least the different forms have 
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been equated on the basis of masses of statistical data. One of the 
purposes of this investigation was to determine whether the forms were 
equal under conditions which were definitely controlled. 

Method of Giving the Tests. In this investigation some forms of 
each of the 3 previously-mentioned tests were given on 4 successive days 
in one of the public schools in Detroit. The Courtis Supervisory and 
the Woody-McCall tests were given in succession on Tuesday, Wednes- 
day, Thursday, and Friday of the. last week in April, 1923, and the 
addition and multiplication tests of the Courtis Research Test, Series B, 
were giver. on the same days of the first week of May. The tests of 
each series on the first day were given during the first hour after school 
opened; the tests of the second day, during the second period; of the 
third day, during the third period; and of the fourth day, during the 
fourth period. This variation in the time of giving the tests was not 
desired in the original planning of the investigation, but was a con- 
cession to the objection that the giving of the tests at the same hour 
during the week interfered too much with the regular routine of the 
school. The difference in time of the giving of the tests introduced a 
slight variable, but as will be seen in a moment, it will have no influ- 
ence on the major part of the investigation. The extent of its influence 
on the other parts of the investigation is unknown, but probably it is 
small. 

In order to distribute equally the influence of the learning resulting 
from the taking of the tests on 4 successive days over the 4 forms of 
the test and to neutralize any possible effect which might accrue from 
differences in the equality of the different forms of the tests, a revolving 
scheme of administering the tests was followed. Each grade in which 
the tests were given was divided into 4 groups of pupils who took the 
different forms of the tests as indicated in Table I. 


TABLE I.—REVOLVING SCHEME FOR ADMINISTERING THE TESTS 


eae Fe prescient a 


Forms USED ON THE INDICATED Days 








Group |——_—_—__———__——“—- —— 
First Day |Seconp Dax| Turrp Day peerene Day 
| 
I 1 | 2 3 | 4 
II 2 3 4 1 
III. 3 t 1 2 
RV. cnics cocnepets 4 1 2 3 


This scheme makes it evident, on the one hand, that any improve- 
ment manifested thru the successive repetition of the tests is a result 
of learning and not due to the difference in the difficulty of the forms 
of the test and, on the other hand, that any difference manifested on 
the total score on the different forms of the various tests is due to the 
lack of equality in the forms of the tests and not due to improvement 
resulting from practice in taking the tests. 

All tests were given and scored, and all tabulations were made 
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according to regular directions provided with the tests, with the one 
exception that the sum of the score of the tests in addition and multi- 
plication was taken as the final score on the Courtis Research Tests 
instead of considering scores for each test separately. 

In making the tabulations only those pupils who had taken each of 
the 4 forms of the 3 different tests were considered. The investigation 
began with 200 children about equaliy distributed among Grades IVA, 
VA, and VIA, but it ended with only 123 children distributed as fol- 
lows: In Grade IVA, 40 pupils; in Grade VA, 39 pupils; and in Grade 
VIA, 44 pupils. While these numbers are rather small, they represent 
all of the pupils in these grades from at.least 2 rooms who took all of 
the different tests. 


RESULTS 


Improvement Resulting from the Repetition of the Tests. Tables II, 
III, and IV show the amount of improvement resulting from the repeti- 
tion on 4 successive days of the Woody-McCall Tests in Mixed Funda- 
mentals, the Courtis Supervisory Tests, and the Courtis Research Tests, 
respectively. In these tables the scores made in each of the 3 grades on 


TABLE II.—IMPROVEMENT RESULTING FROM THE REPETITION 
OF THE WOODY-McCALL TESTS IN MIXED FUNDAMENTALS 
ON FOUR SUCCESSIVE DAYS 


Tuirp | FourtH | Mean 














| 
First SECOND | 
| .Day Day | Day Day Score 
are Ste | ‘nilline ae deatiinedies 
Grape IV A (40 pupils).! | 
Mean Score..........| 15.1 15.1 | 16.0 15.7 15.5 
Eee | 100 100 | 106 =| 104 103 
Grape V A (39 pupils) | | 
Mean Score..........| 21.2 22.3 23.7 | 23.3 | 22.6 
Per cent | 105 ae -.| @ 107 
Grape VI A (44 pupils) | 
Mean Score... | B28 | 2.5 25.6 23.0 23.9 
Per cent aawe 100 100 | 109 98 102 
Mean Score......... 20.0 20.4 | 21.9 20.7 20.8 


Per cent.. sees . 100 102. —_—s:i10 104 104 


each of the 4 days and the average mean scores for the 4 days are pre- 
sented. The score made on the first day is considered as the base, i.e., 
100 per cent, and the gain resulting from the repetition of the tests is 
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TABLE III.—IMPROVEMENT RESULTING FROM THE REPETITION 
OF THE COURTIS SUPERVISORY TEST IN ARITHMETIC ON 


FOUR SUCCESSIVE DAYS 
































First SECOND THIRD FourtTH MEAN 
Day Day Day Day Score 
Grave IV A (40 pupils) 
Mean Score.......... 9.9 11.6 12.0 10.7 11.1 
Per cent........... *.) 100 117 121 108 112 
GRADE V A (39 pupils) 
Mean Score......... 15.8 17.1 18.1 17.9 17.2 
\, 5 eer es 100 | 108 115 113 109 
Grave VI A (44 pupils) | 
Mean Score.......... 17.2 17.0 17.5 17.9 17.4 
gee 100 99 102 104 101 
Mean Score............ 14.4 15.3 15.9 15.6 15.3 
Per Enea RUGER Sa Pia 100 106 | 110 108 106 











ascertained by expressing scores made on succeeding days in terms of 
this base. The bottom section of each of these tables represents the 
mean for the 123 pupils, regardless of the particular grades in which 
they happen to be. 


TABLE IV.—IMPROVEMENT RESULTING FROM THE REPETITION 
OF THE COURTIS RESEARCH TEST IN ARITHMETIC, 
SERIES B, ON FOUR SUCCESSIVE DAYS 
































First SECOND THIRD FourtTH MEAN 
| Day | Day Day Day Score 
Grape IV A (40 pupils) 
Mean Score.......... 4.2 5.3 5.9 5.3 5.2 
PAGE. «os ciecwes.: | 100 | 126 140 126 123 
Grape V A (39 pupils) 
Mean Score.......... if a 9.4 10.8 11.3 9.7 
| reer 100 | 129 148 155 133 
Grape VI A (44 pupils) 

Mean Score.......... 9.1 11.8 11.9 9.5 10.6 
MIE, sc 00. vcs bnod 100 130 131 104 116 
Mean Score............ | 6.9 8.9 9.6 8.7 8.6 

TS | 100 129 139 126 124 














5—29969 


i 
| 
} 
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Table V is a summary of the median scores obtained from this 
group of 123 pupils. As in Tables II, III, and IV, the scores of the 
successive days are expressed in percentages of the first day’s score. 
The median scores in this table correspond to the mean score given in 
the bottom section of the three tables previously mentioned. 


TABLE V.—SUMMARY OF THE IMPROVEMENT RESULTING FROM 
THE REPETITION OF THE DIFFERENT TESTS IN ARITH- 
METIC ON FOUR SUCCESSIVE DAYS 


























First | SEconD THIRD | FourtTH MEAN 
Day | Day Day Day Score 
Woopy-McCa.i | | | 
(123) pupils) 
Median Score....... 19.9 | 20.7 | 21.4 | 2.3 | 20.5 
Percent.............| 100 | 104 | 108 | 102 | 103 
| | 
Courtis SuPERVISORY 
(123 pupils) | 
Median Score........| 15.7 | 16.1 | 17.1 | 16.1 16.2 
Per cent............. | 100 | 103 | 109 «=| «(108 104 
Courtis RESEARCH | | 
(123 pupils) | 
Median Score........ 72 | 9.1 | 9.3 | 7.9 8.5 
Per cent.............| 100 128 | 131 111 118 











The most outstanding generalizations to be drawn from Tables II 
to V, inclusive, are as follows: 


1. The gain computed on average scores from 4 successive repeti- 
tions of the Woody-McCall Test was approximately 4 per cent; of the 
Courtis Supervisory Tests, approximately 8 per cent; of the Courtis 
Research Test, approximately 25 per cent. The gain based upon the 
median instead of the average scores was 2 per cent on the Woody- 
McCall, 3 per cent on the Courtis Supervisory Test, and 11 per cent on 
the Courtis Research Tests. It is thus noted that the gain computed 
on the basis of the average scores is much greater than that computed 
on the basis of the median scores. 

2. Greater gains were manifested on the “speed” tests than on the 
“power” tests, i.e., greater gains on the 2 Courtis tests than on the Woody- 
McCall Test. Gates found similar results in his investigation of the 
amount of improvement accruing from successive repetitions of various 
types of reading tests.* These facts may be in keeping with the asser- 
tion by Courtis that tests without time pressures are in strict reality 
intelligence tests. 

3. It is interesting to note that in each of the tests superior results 
were obtained on the third day. The principal of the school in which 
the tests were given asserted that the children had grown a little tired 


* Gates, 7% G. “Study of Comprehension in Reading by Means of a Practice Experi- 
ment.” Journal of Educational Research, VII (1923), pp. 37-50. 
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of taking the same kind of test for 4 days and that this was the cause 
of the lower scores on the fourth day. The author offers no explana- 
tion for the fact but is inclined to look upon it as a mere coincidence. 
The above facts make it clear that in repeating various forms of tests 
mere repetition of tests will be responsible for a certain amount of gain. 
This gain no doubt represents some gain in the actual ability to solve 
the exercises in arithmetic and some gain resulting from an improve- 
ment of technique in taking the tests. The latter type of gain does not 
represent real gain in the ability to solve exercises in arithmetic and 
if possible should be isolated from the other type. It is highly probable 
that the gain manifested in some school systems in which one form of 
a test is given at the beginning of a semester or some other period of 
training and another form given at the end of the semester or at a 
specially designated time resulted from improvement in the technique 
of taking the tests rather than in the real ability measured by the test. 
Undoubtedly the amount of gain from the repetition of tests at the end 
of a semester or of a year would be smaller than that manifested in 
this investigation. It is highly probable that the gain made by the 
pupils in this investigation was smalier than it would be in most schools 
because these pupils are thoroly accustomed to the taking of tests and 
are well-acquainted with testing technique. 

Tables VI, VII, and VIII show the results achieved on the different 
forms of the test. These tables are similar in construction to Tables 
III, IV, and V save that the data are for different forms of the tests 
rather than for different days and that the general summaries based 
upon average medians are placed in parallel horizontal columns of the 
different tables rather than in separate tables. 


TABLE VI.—SCORES ACHIEVED ON THE FOUR DIFFERENT FORMS 
OF THE WOODY-McCALL TESTS IN ARITHMETIC 
































[ 
a MEAN 
Form 1| Form 2| Form 3/| Form 4 Scots 
Grave IV A (40 pupils) 
NE yee 15.3 15.5 16.1 15.1 15.5 
Variation from Mean Score..... 2 ies 6 4 3 
Grave V A (39 pupils) - | 
pe  eeree me ee 21.9 23.8 22.6 
Variation from Mean Score..... a 2s | 7 1.2 8 
| 
Grave VI A (44 pupils) | | 
a Sh A ee | 22.8 | 25.4 23.5 | 23.8 23.9 
Variation from Mean Score..... 11 | 1.5 | 4 | 1 8 
| 
| i | | 
a mR ey PEERS Pee | 19.9 | 21.6 20.6 | 21.0 | 20.8 
Variation from Mean Score........ 9 | 8 | 2 | 2 | 5 
| | 
Median Scores................... | 19.8 | 20.8 | 2.9 | 20.8 | 20.6 
Variation from Median Scores... ; 8 | 4 3 | 8 4 
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TABLE VII —SCORES ACHIEVED ON FOUR DIFFERENT FORMS OF 
THE COURTIS SUPERVISORY TESTS IN ARITHMETIC 
(NUMBER RIGHTS) 

















j ME N 
| Form 1! Form 2| ‘Fou 3| Form 4) g | Mn 
| Score 
Grape IV A (40 pupils) | | 
Mean Score. - cath. Se 1 ae ee 10.6 | 11.1 
Variation frenn Mean Dette ete 4 ay 6 | 5 
Grape V A (39 pupils) | | 
Mean Score.. sie 17.9 | 17.4 | 17.2 | 17.2 
Variation bom _— Sous =? | S | a 2 0 | 
Grave VI A (44 pupils) . 
Mean Score........ ; 16.9 7:8 | i7:6 iv.) } 87.4 
Variation from Sasi Se. ore.....| 5 | 4 A = 3 
a i ee aes | 14.7 15.7 | 15.6 15.1 15.3 
Variation from Mean Score....... 6 | 4 | 3 | 2 | 4 
} | eins ‘ 
| | | | 
Median Score.. +t iw g Se | 16.8 | 16.2 | 16.3 | 16.1 
Variation from Median Shine phe a ae 1 -2 | 5 











TABLE VIII.—SCORES ACHIEVED ON FOUR DIFFERENT FORMS 
OF THE COURTIS RESEARCH TEST IN ARITHMETIC, 
SERIES B (NUMBER scamaald 





| | 





























= : 
| Mean 
| Form 1| Form " Form | Form 4 Scam 
a) AO Rae. | 
| 
Grape IV A (40 pupils) | | | 
oe ; 68 2-839 5.2 §.3 | 5.2 
Variation from Mean Score.... .| 3 0 | 0 a | 1 
Grape V A (39 pupils) ° 
IN Eis on ccs und cer 10.3 10.1 9.3 62.1 Of 
Variation from Mean Score..... 6 4 4 6 | 5 
Grave VI A (44 pupils) | | | 
Nn Pe OEE 11.9 | 10.1 10.0 | 10.3 10.6 
Variation from Mean Score.....| 1.3 | 5 | 6 | . 7 
NS te Ly IMEI te 91]/ 85 | 82] 83] 86 
Variation from Mean Score....... ms) | 1 4 | 2 | 3 
— — 
reer ere 8.4 | 72°) 38 7.3) FS 
Variation from Median Score .. z 6 4 oy 0 | i) 
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From Table VI, which will be discussed somewhat in detail, altho 
it is no more significant than Tables VII and VIII, the following points 
are evident: 

1. In Grade IV, so far as this group of children is concerned, the 
4 forms are approximately equal in difficulty. According to data on 
the latest edition of the Woody-McCall score sheet, the difference in the 
difficulty of the forms is equivalent to 1 month’s progress. 

2. According to the Woody-McCall score sheet, to make the stand- 
ards of achievement comparable with results obtained later than Octo- 
ber, add for each month after that time the following increments: 
Grade III, .54; Grade IV, .43; Grade V, .42; Grade VI, .24; Grade VII, 
.25; Grade VIII, .20. 

3. In Grade V, Forms I and II are approximately equal as are 
Forms II and IV. The variation of the forms from the average is from 
amounts equivalent to 1 month’s improvement to amounts equivalent to 
3 months’ improvement. 

4. In Grade Vf, Forms III and IV are equal, but Forms I and II 
vary somewhat from the average values. Form I varies from the gen- 
eral average of the 4 forms by an amount equivalent to 5 months’ im- 
provement, and Form II by an amount equivalent to almost 6 months’ 
improvement. 

5. Thus it is seen that 2 forms may be equal in difficulty in one 
group and unequal in difficulty in another group. 

6. On the basis of the average scores for the 123 pupils, the 4 
forms were approximately equal in difficulty; on the basis of the median 
scores, the equality of the different forms is about as perfect as can 
be obtained. The variation in the difficulty based upon the general 
average of the scores for the 123 pupils was in amounts equivalent of 
1 to 3 months’ progress; based upon the general median, in amounts less 
than that equivalent to 1 month’s progress. 


Tables VII and VIII show similar facts concerning the Courtis 
Supervisory and the Courtis Research Tests. While data are not avail- 
able for expressing the amounts of difference in the difficulty of the 4 
forms in terms of a year’s progress it is ventured on the basis of the 
intervals between the standards for various grades that they vary by 
amounts equivalent to from 1 to 5 months’ progress. 

The 2 most important generalizations to be drawn from the 3 pre- 
ceding tables are as follows: 


1. Forms equated statistically by the utilization of a large number 
of cases may not prove to be equal when applied to a small group. This 
raises the question of the value of such extensive effort to equate statis- 
tically the difficulty of particular tests, or forms of tests, thru great 
masses of data. 

2. In setting up controlled experiments special precautions must 
be taken to guarantee that differences in the difficulty of the forms of 
the tests are not determining factors in the results obtained. It is 
suggested that some form of a revolving schedule similar to the one 
used in this investigation be utilized. If this is not done differences in 
the difficulty of the forms of tests, even tho equated on the basis of 








70 BULLETIN OF THE SCHOOL OF EDUCATION 


mass-statistics, may give the appearance of learning or lack of learning 
and may innocently lead to wrong conclusions. 


The Reliability of the Different Tests. The usual methods for de- 
termining the reliability of a test are either to find the coefficient of 
correlation between the scores obtained on identical tests or different 
forms of the same test given after intervals of a few hours or a few 
days, or to find the correlation between the scores on the odd and even 
questions of a test. The former method is utilized to some extent in 
this investigation, but the main emphasis is based upon the assumption 
that the average of the 4 scores on the different tests is the perfect 
score and that the correlation between the scores on 1 form of a test 
with the average of the scores on all 4 of the forms, i.e., the perfect 
score is a better index than other available measures. These latter 
coefficients of correlation computed after the Pearsonian method and 
presented in Table IX are sufficiently high to indicate what is usually 


TABLE IX.—CORRELATION OF THE SCORES ON THE DIFFERENT 
TESTS IN ARITHMETIC WITH THE AVERAGE SCORE ON THE 
FOUR FORMS AND THE CORRELATION OF THE SCORES 
ON THE DIFFERENT DAYS WITH THE AVERAGE 

SCORE FOR THE FOUR DAYS 








Scores ON DiFFERENT ForMsS Scores ON DIFFERENT Days 
WITH AVERAGE ScorE ON 4 Forms || wirH AVERAGE Scores on 4 Days 
GRADE GRADE 
wat Fa VIA || | IVA!VA|VIA 


Woopy-McCa.i Test 


Form Day | 
‘ ; 91 78 86 eo 87 59 86 
2 .89 77 . ff oe .i. 89 
3. 86 69 82 Ee 92 72 87 
1. 88 64 Ss a ee .93 | .84 84 


Courtis Supervisory Test 


Form Day 
1. 94 | .87 94 eee 94 ce 95 
2 .92 .78 94 ear 94 87 91 
3. Sad 94 .78 94 . | eee .97 .82 .94 
4 . .92 .82 94 ee .92 81 .95 

Courtis ResearcH TEst 

Form Day 
1 .93 .83 92 Pet acas .90 84 .90 
2; 92 .90 . i eee 87 .87 
3. wits .89 .84 89 || 3d.......) .96 93 | .98 
4 


92 17 84 || 4th..... 02 | .88 | .92 
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termed “high reliability”. This means in plain English that there is 
a very marked tendency for the pupils to maintain the same relative 
ranks on the basis of the scores on the different forms of the tests as 
on the average of the scores for the 4 forms and for the pupils to hold 
about the same rank on the basis of the score made on each of the 
days as on the average score for the 4 days. Thus the ranks obtained 
on the basis of the different forms of the tests are relatively reliable. 

The reliability of the 3 tests is virtually the same, altho there is 
a slight tendency for the 2 Courtis tests to have slightly higher coef- 
ficients. These findings are rather surprising to the writer since he 
has usually found higher coefficients of reliability for the Woody-McCall 
Tests than for either of the Courtis Tests. In fact the coefficients of 
reliability for the Courtis Tests found in this investigation are from 
.10 to .20 higher than the writer has usually found. In all probability 
higher coefficients found in this investigation are due to the fact that 
the children in Detroit participating in the investigation have been 
tested many times and are thoroly familiar with testing technique. It 
would be interesting to check on this investigation by repeating it in 
schools in which the children are less familiar with testing and testing 
technique. 

Another possible cause for the higher coefficients of reliability in 
this investigation is that all scores for a particular day are included in 
the average of the scores. Some idea of the effect of correlating the 
scores with the average of scores which included the particular scores 
concerned can be derived from the following statement: The coefficient 
of correlation between the scores of Form I of the Woody-McCall Test 
in Grade V and the average of the 4 forms is .78; between the scores 
of Form I and the average of the scores on Forms II, III, and IV, .58. 
Other coefficients will be greatly reduced if the influence of the par- 
ticular element concerned is eliminated from the criterion. 

The correlation of the scores achieved in Grade V on one day and 
those achieved on-each of the other days are presented in Table X. 
These coefficients may be termed the coefficients of reliability calculated 
after one of the first methods mentioned at the beginning of the dis- 
cussion of reliability. However, there is a slight variation in the usua: 
method of determining reliability, viz., not all of the one form of a 
particular test was given on any one day, but a portion of each form 
was given on each of the 4 days. Since all forms of the tests are equal 
in difficulty, this variation should have no influence, but whether it did 
or not is not known. 

The coefficients of reliability as presented in Table X are much 
smaller than those presented in Table IX. This is especially true with 
the Woody-McCall Test. The coefficients between the scores obtained on 
the first day and on the other days are especially low, but the coefficient 
between the scores on the other days approximate the size of the coef- 
ficients obtained between the scores on the Courtis Tests on the different 
days. The cause of the low coefficients in which the scores of the first 
day are involved is not know. But the other coefficients are not any 
too high. In fact they are so low that they suggest that if it is desire| 
to transform the individual scores into terms of grade norms or to cal- 
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culate the efficiency or achievement quotients, more than one form of 
the tests should be given. 


The Relation of Success on One Test to Success on the Other Tests. 
Table XI shows the coefficients of correlation between the scores in 
Grade VA on the different forms of the various tests and between the 
scores on the various tests made on different days. Scrutiny of this 
table reveals the following outstanding facts: 

1. There is much variation in the size of the coefficients of cor- 
relation existing between the scores on the various forms of the 2 tests 
and between the scores made on the 2 tests on different days. The co- 
efficients between the scores on any of the forms of the Courtis Super- 
visory Tests and the Woody-McCall vary from .15 between scores on 
Form IV of the Courtis Supervisory and scores on Form IV of the 
Woody-McCall Test to .61 between scores on Form IV of the Courtis 
Supervisory Tests and scores on Form I of the Woody-McCall Test; 
between the scores on any of the forms of the Courtis Research Tests 
and the Woody-McCall Test from .02 between scores on Form IV of the 
Courtis Research Tests and Form IV of the Woody-McCall Test to .64 
between scores on Form I of the Courtis Research Tests and scores on 
Form I of the Woody-McCall Test; between any of the forms of the 
Courtis Supervisory Tests and the Courtis Research Tests, from .32 
between scores on Form IV of the Courtis Supervisory Tests and scores 
on Form III of the Courtis Research Tests to .65 between scores on 
Form II of the Courtis Supervisory Tests and scores on Form II of 
the Courtis Research Tests. Similar variation exists between the scores 
on 2 tests on different days. This great range in the value of the size 
of the coefficients between scores on forms of the tests which are identi- 
cal or between scores on the same tests on different days should cast 
suspicion on placing great emphasis on a single coefficient or upon a 
very small number of them. 

2. The coefficients of correlation between the average score on the 
4 forms of the test or the average score for the 4 days and the score 
for any 1 form or any 1 day are higher than the coefficients between 
the scores on single forms for any 2 tests or between the scores on any 
2 days. This is to be expected since the average of the scores on the 4 
forms or of the scores of the 4 days is a better score than the score on 
a single form or on a single day and since the average includes the 
scores on a single form or on a single day. 

3. The coefficients of correlation between the 2 Courtis Tests are 
higher than the coefficients between scores on either of the Courtis Tests 
and scores on the Woody-McCall Test. This suggests that the abilities 
measured by the 2 Courtis Tests are closely related, but that these abili- 
ties are different from those measured by the Woody-McCall Tests. This 
corroborates the conclusion reached in the discussion of the amount of 
improvement resulting from the repetition of the different tests in 
which it was pointed out that Courtis Tests are more susceptible to 
gain than the Woody-McCall Tests. 

In general, these facts indicate that the students who are successful 
on 1 test tend to be successful on other tests. This tendency is especially 
strong in case of the 2 Courtis tests. These tendencies are even more 
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pronounced when consideration is given to the average scores on forms 
of 1 test or the average of the scores for the 4 days and similar aver- 
ages for either of the other tests. Table XII indicates that when the 
average scores for the 4 days on 1 test are correlated with the average 
seores for the 4 days on either of the other tests the coefficients become 
much higher than when the averages are correlated with the scores fo: 
a single form or a single day. This condition exists in all probability 
because of the greater reliability of the average scores as measures of 
ability in arithmetic. To make the statement conversely, the correla- 
tions between average scores on forms or days and average scores are 
greater than between average scores and the scores for a single form 


TABLE XII.—COEFFICIENTS OF CORRELATION BETWEEN THE 
AVERAGES OF THE SCORES FOR THE FOUR DAYS ON 
THE DIFFERENT TESTS 








| Woopy-McCatu gy Covurtis SUPERVISORY 
| 


Grave | GRADE | GRADE 
i veh VV wt OS 


' | 


Grape | Grape | Grape 
VIA VA IVA 





Courtis Supervisory... . . 65 55 .70 
Courtis Research 18 65 





or day because of the unreliability of the scores on a single form or on 
a single day as a measure of ability in arithmetic. This statement cor- 
roborates the conclusion reached in the section dealing with the reliabil- 
ity of the tests. 

The Reliability of the Educational Quotient Technique. In order to 
evaluate the effectiveness of the educational quotient technique the edu- 
cational quotients calculated by the method employed by McCall* were 
determined from the scores of the various forms of the different tests 
or from the scores from the results on the different days. Table XIII 
setting forth the educational quotients in Grade V based upon the scores 
on the Woody-McCall and Courtis Research Tests on the different days 
is typical of the results obtained. Other results would have been in- 
cluded but space did not permit. Examination of Table XIII indicates 
that the educational quotients for some pupils are fairly constant for 
the 4 days under consideration, but for other pupils the variation in 
the quotients is surprisingly great. For example, on the Woody-McCall 
Tests the educational quotients for Pupils Nos. 1, 6, 22, 26, 27, 30, 34, 
35, 38, and 39 are fairly constant and would be safe criteria as bases 
for educational procedure, but the educational quotients for pupils Nos. 
5, 17, 20, 23, 31, 32, and 33 are by no means constant. The educational 
quotients for this group of pupils indicate on some of the days that 
the pupils have a high educational efficiency and on other days a low 
educational efficiency. Similar facts are revealed in the examination of 
the quotients based upon the results of the Courtis Research Tests. In 


*MecCall, W. A. How to Measure in Education. Macmillan Company, New York, 
1922, Chapter II, pp. 19-66. 
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TABLE XIII.—-EDUCATIONAL QUOTIENTS BASED UPON THE 
SCORES IN GRADE V ON THE WOODY-McCALL AND COURTIS 
TESTSON FOUR DIFFERENT DAYS (After McCall) 


5 Woopy-McCatt Tests i Courtis Rrsearcu Tests 
NUMBER 




















or Pury | isr | 2> {| 3p | 4m | Ist | 2p | 3p | 47H 
SS ee ei oe | | SEAT Ts. 
Lowe fam [27 [125 fae | ss | ss | a8 | gs 
Sick es, 89 | 104 | 117 | 100 | 7 | 71 | 741. 71 
ery ....{ 1 | 187) «| 187) | «118 | 102 | 128 | 118 | 128 
SERS 89 | 89 | 92 | 9 | 57 71 | 80 | 132 
Bist 97 | 106 | 125.6| 105.6) 78 | 100 | 143 | 143 
Os -----| 125 | 127.5 | 133 | 130 | 130 | 156 | 172 | 130 
7......2+-+-..] 188 | 115.9 | 182 | 115.9] 91 118 | 118 | 128 
Rt6Sez5 ...-.| 100 | 102 | 116 | 107.6} 68 72 | 80 | 108 
o: aerees 121 | 137 | 147 | 137 | 128 | 128 | 198 | 198 
ene ee | 97 | 108 | 100 | 120 | 138 | 155 | 155 | 155 
Wh iticeake ---| 102. | 116.6 | 106 | 116.6} 117 | 130 | 117 155 
Sere | 94 | 112 | 96 | 120.5] 72 | 77 | 108 | 108 
Tee = a | 102 | 111 | 88 | 94 83 87 
BG is heeete Sc 113.6 | 182.5 | 127 | 156.8} 95 | 142 | 128 170 
15............., I | 128 | 137, «| 127, «| 91 | «100 | 109 | 156 
eee 89 | 92.8| 115 | 100 | 86 | 100 92 86 
~ PES 10s | 122 |139 |122 | 83 | 143 | 181 | 143 
WB, schon -..| 104 | 102 | 113 | 126 5 117 | 117 | 181 | 181 
hice --+-| 104 | 100 | 126 | 104 | 108 | 143 | 117 | 156 
20............-[ 107 | 98 | 115 | 115 | 145 | 145 | 145 | 145 
ee S 9 | 111 | 100 | 76 | 87 83 | 83 
22. sink od ee 157 | 157 |:157 217 | 217 | 217 | 217 
eer: 2 oS a 92 92 | 104 | 85 
24, 4 96 | 104 87 | 132 | 167 | 77 68 
, Ge eRe 109 116 =| ‘121 118 |} 91 | 86 | 118 | 110 
ye wh 103 98 100 98 | 108 | 100 | 108 100 
yp ee a. | @ |e 1 LO oe 108 | 167 
cures 111 | 108 | 118 | 118 | 109 | 109 | 128 | 198 
29..... 109 | 116 | 103 | 121 | 91 | 102 | 110 | 118 
Wi. aceasta Lo a Te ee cco | 143 | 108 
is casks 122 | 8 | 7 | 87 | 69 87 | 145 | 145 
32: sete 92 | % | 91 57 | 144 | 132 | 100 
aR 149 99 97 | 100 74 | 78 78 | 143 
34. j 92 100 91 94 71 68 65 57 
ee 98 | 102 98 | 102 87 | 117 | 108 | 181 
ey ee 94 97 | 106 | 104 83 87 | 108 | 108 
87...........-.| 105 | 11 | 1 | 108 | go | 10 | 91 | 18 
Th: ek ...| 118 .| 120 | 120 | 120 121 | 141 | 186 | 186 


Pee | 79 91 | 82 | 88 71 | 74 | 86 | 71 
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fact the variation with these tests is even more pronounced than the 
Woody-McCall Tests. With these tests Pupil No. 4 had an educational 
quotient of 57 on the first day and 132 on the fourth day; Pupil No. 5, 
an educational quotient of 78 on the first day and 143 on the third and 
fourth days; Pupil No. 35, an educational quotient of 87 on the first day 
and 181 on the fourth day; Pupil No. 24, an educational quotient of 167 
on the second day and 68 on the fourth day. Many other illustrations 
can be cited, but a suificient number have been given to establish the 
fact that considerable caution should be exercised in the use of educa- 
tional quotient technique. 

On the Woody-McCall Tests, the average deviation of the scores 
on the first day from the average of the scores for the 4 days is 9.2 
points; of scores on the second day, 4.9 points; of scores on the third 
day, 7.4 points; of the scores on the fourth day, 5.5 points. The average 
of these deviations is 6.6 points. On the Courtis Tests the average 
deviation of the scores on the first day from the average of the scores 
for the 4 days is 18.5 points; of the scores on the second day, 12.6 points; 
of the scores on the third day, 13.8 points; of the scores on the fourth 
day, 19.5 points. The average of these deviations is 16.04 points. 
Similar deviations were found in other grades and with the Courtis 
Supervisory Test. 

The size of these deviations is great enough to suggest that any 
attempt to base educational procedure upon such unreliable measures 
will be subject to great modification. This does not mean that the edu- 
cational quotient technique is worthless as an educational device, but 
it does mean that much greater effort must be made in the future to 
obtain reliable measure of abilities before the technique of the educa- 
tional quotient is applied. When the reliability of the measurements 
has been established, then the educational quotient technique becomes 
one of the most useful of the recently developed devices in education. 
But unless considerably more effort is expended in the future than in 
the past in determining the reliability of the original measures it is 
feared that many educational sins will be committed under the guise of 
scientific and progressive education. 


SUMMARY 


Upon the basis of the facts presented in this investigation the fol- 
lowing statements are warranted: 

1. Based on the average scores, the gain from 4 successive repeti- 
tions of the Woody-McCall Tests was 4 per cent; from the Courtis 
Supervisory Tests, 8 per cent; from the Courtis Research, 26 per cent. 
The gains based upon the median was 2 per cent for the Woody-McCall; 
3 per cent for the Courtis Supervisory; and 11 per cent for the Courtis 
Research Tests. 

2. Greater gain from the repetitions of the tests resulted on the 
two Courtis Tests than on the Woody-McCall Tests. 

3. The 4 forms of the different tests were found to be of approxi- 
mately the same general difficulty, altho the amount of variation in the 
difficulty of the forms was more apparent in one grade than in others. 
This suggests that forms equated on the basis of large masses of statis- 
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tical data may be unequal when given to small groups of children. In 
planning experimental work special precautions should be made to 
eliminate the influence of inequality in the difficulty of the forms. 

4. The two Courtis Tests proved to be more reliable in this investi- 
gation than the Woody-McCall Tests, altho neither of the tests were 
sufficiently reliable to serve as an adequate measure of an individual 
pupil. 

5. There was a fairly high correlation between the scores obtained 
on the Courtis Supervisory and the Courtis Research Tests, but a rather 
low correlation between the Woody-McCall Tests and either of the 
Courtis Tests. This signifies that the Courtis Tests and the Woody- 
McCall Tests measure different aspects of ability in arithmetic. 

6. The variation in the educational quotients obtained from the 
scores of the Woody-McCall and the Courtis Research Tests was so 
great that it seems folly to apply the educational quotient technique 
until the reliability of the original measures has been more thoroly 
established. 











The Permanent Influence of the Teaching of 
Spelling 


CuiFFORD Woopy, Director of Bureau of Educational Reference and 
Research, University of Michigan 


THE GENESIS OF THE EXPERIMENT 


THE investigation on the “Permanent Effect of the Teaching of 
Spelling” evolved from attempts to explain the situation revealed thru 
various measurements of spelling in numerous cities of Michigan. In 
October, 1921, tests in spelling were given in various cities, and the 
results showed that the general level of spelling efficiency in the cities 
and in the states as a whole was considerably below the Ayres standards 
of efficiency. In May, 1922, when other tests were given, the results 
showed much gain, and the Ayres standards were more nearly approxi- 
mated than in the previous October. In October, 1922, a new series 
of tests was given, and the results showed that the existing level of 
efficiency was little better, if any, than that existing in October, 1921. 
Most of the improvement apparent in May, 1922, had vanished. The 
results in May, 1923, again showed marked improvement over those in 
the previous October and the Ayres standards were again almost 
attained. 

In attempting to throw some light on the situation portrayed by 
the results from the 3 previous tests, the tests for May, 1923, were so 
constructed that 10 of the 20 words given each grade were selected 
from the same columns of the Ayres-Buckingham Scale as the October 
lists and 10 of the 20 words—one-half of each list—from the October 
list. Furthermore, at the time of administering the tests each teacher 
was asked to check all words in the list for her grade which she had 
not taught during the current year. 

The results from these tests were interesting but baffling. For ex- 
ample, in Grade III the average results from 17 different cities taking 
the tests in both October and in May showed that the average score 
obtained on the 10 words common to both tests was 78 per cent cor- 
rect; on the 10 words peculiar to the May test, 78 per cent correct. In 
Grade VI the percentage of correct spelling on the 10 words in both 
tests was 82 per cent; on the list peculiar to the May list, 82 per cent. 
Similar results were obtained in other grades, altho it should be said 
that percentages of correct spelling on the two lists in some of the 
grades were not identical. That the scores on the two lists were so 
nearly identical is one of the perplexing problems, for it is evident that 
some emphasis in teaching had been given to the words in the October 
lists, even if no more than that given in the administering of the tests, 
but several of the words peculiar to the May list had received no 
emphasis in teaching if credence can be given to teachers’ checking of 
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the words which had not been taught. The number of words in the 
total list taught varied in Grade IV from 4 to 17; in Grade V, from 4 
to 18. In Grade IV the score on the words taught was 89 per cent 
correct; on the words not taught, 64 per cent; in Grade V on the words 
taught, 81 per cent correct; on the words not taught, 63 per cent cor- 
rect. One teacher in Grade IV, 3 teachers in Grade V, and 1 teacher 
in Grade VII made higher scores on the words not taught than on the 
words taught. Several teachers who had taught less than half of the 
words achieved considerably higher scores than other teachers who had 
taught considerably more than half of the words. 

Facts of this kind naturally provoked much discussion concern- 
ing the teaching of spelling, and many explanations were offered. It 
was asserted that the situation was caused thru the poor selection of 
the spelling vocabulary of our numerous textbooks. According to this 
assertion the child’s need is the main criterion to be utilized in the 
selection of a spelling vocabulary, and it is asserted that any attempt 
to teach words which are not needed for expressing the child’s thoughts 
in writing will be futile since, if there is no need for spelling them, 
they will be forgotten. From a slightly different angle it was asserted 
that the situation was due to the fact that the nature of the lessons 
assigned and the methods employed are conducive to learning on the 
“cramming level”, i.e., studying the lesson just before the class and re- 
taining knowledge of how to spell the words just long enough to “get 
by” the recitation. From another angle it was asserted that the major- 
ity of the teachers were not really teaching spelling but that they were 
testing spelling or hearing spelling. The proponents of this last asser- 
tion insisted that the teachers should supervise the study of the chil- 
dren and teach them how to study, how to attack new words, how to 
use them, how to review, etc. They maintained that unless proper study 
habits were emphasized that no permanent results could be expected. 


PURPOSE OF THE EXPERIMENT 


In order to provide some evidence on some of these mooted ques- 
tions, the Bureau proposed an investigation for more adequately measur- 
ing the permanent effects of the teaching of certain lists of words. The 
investigation was so planned as to provide some evidence on the fol- 
lowing aspects of the teaching of spelling: 


1. The percentage of words which could be spelled before the 
words had been studied. 

2. The score obtained on the different lists when the teacher felt 
that the children had mastered the words ‘sufficiently well to warrant 
the teaching of additional words. 

3. The percentage of the words which could be spelled correctly 
at the end of a month during which there was no review of the words 

4. The percentage of words which could be spelled after the sum- 
mer vacation. 


5. Comparison of the amount of retention on lists of different dif- 
ficulty. 
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6. The relationship between scores on the preliminary tests and 
the amount lost between the second and third tests. 
7. The persistency of certain misspellings. 


GENERAL PLAN OF THE INVESTIGATION 


This investigation was conducted in Grades IV, V, and VI during 
the second semester of the 1922-23 school year. All told, 186 teachers 
in 15 different cities in Michigan, and Washington, D.C., and Oklahoma 
City participated in the experiment, altho the facts for only 107 teachers 
have been tabulated at the present time. 

Four different word lists each containing 20 words selected from 
the Horn-Ashbaugh Spelling Scale constituted the spelling vocabulary 
taught during the investigation. These word lists were selected from 
such columns of the scale so that the expected standards ‘of achieve- 
ment in each grade on the initial testing on Lists I to IV, inclusive, 
were as follows: 73, 66, 58, and 50 per cent correctly spelled. The 
words were so selected that List I had about the proper difficulty for 
suitable tests for the respective grades; List II, the proper difficulty for 
suitable tests in grades one-half year in advance for which they were 
designed; List III, the proper difficulty for suitable tests in grades one 
year in advance of the grades for which they were designed; List IV, 
the proper difficulty for suitable tests in grades one year and a half in 
advance of the grades for which they were designed. These lists were 
so selected in order that (1) the words might have sufficient difficulty 
to guarantee that a large proportion of them would have to be taught, 
and (2) the influence of the difficulty of the lists on the permanency of 
the learning might be determined. Thru having the 4 lists with each 
succeeding list representing greater difficulty and becoming farther re- 
moved from the child’s real spelling needs, it was felt that some light 
might be thrown on the assertion that one cannot expect the child to 
remember how to spell words for which he has no need. 

The general directions indicated that the teachers were to test the 
children on the total list of 80 words before teaching any of the words. 
They were directed to keep a complete record of the words spelled cor- 
rectly by each child on this test as an indication of the spelling situ- 
ation at the beginning of the investigation. After this preliminary test- 
ing, which lasted 4 days of the first week of the investigation, the 
teachers were told to teach the 80 words by utilizing their regular 
methods of teaching spelling. The only suggestion given them was that 
in choosing the words to be emphasized on a particular day they should 
select a portion of the words from each of the 4 lists so as to avoid 
having lessons in which all of the words would be especially easy or 
especially difficult. 

Each teacher was directed to teach these 80 words until she felt 
that the children had mastered them well enough to warrant asking for 
new words and then to test on the whole 80 words as she had done in 
the preliminary test. Thus comparison of the results of the preliminary 
test (the first test) and the second test shows the amount of temporary 
gain resulting from her teaching. 
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After the second test the teachers were directed to teach other 
words in spelling just as they would have taught if this investigation 
had not been undertaken. The teachers were told to disregard utterly 
these 80 words and to omit any of which might by chance appear in 
the regular spelling vocabularies. They were directed to give a third 
test on the original list of words exactly one month after the completion 
of the second test. No warning was given the children before this third 
test, and thus there was no opportunity for reviewing. 

After the third testing, the teachers were directed to proceed with 
their regular teaching of spelling as they had been doing during the 
preceding month. They were instructed as before to omit any of the 
original list of words which might appear in the regular spelling lessons. 
As there was little time after the third testing and the close of school, 
the level of efficiency at the third testing may very well represent the 
level of efficiency at the end of the school year. 

In September, after the summer vacation, during the first and second 
weeks of the new school year, as many of the children as could be found 
were tested again with this original list of 80 words. This fourth test 
was given without warning, and again no opportunity was allowed for 
reviewing the words. Several of the teachers were unable to give Test 
IV, and many of the children had moved so that the number of pupils 
taking Test IV was considerably less than the number taking the other 
tests. 

All told, the children in each grade weve tested on 4 different word 
lists at 4 different times. This means that each child to be considered 
in certain portions of this investigation had to have 16 different records. 
When one considers the amount of labor involved in matching the 16 
records of each individual child he gets some idea of the gravity of the 
task. Out of many classes containing 35 or 40 pupils only 8 or 10 
pupils had records for all of the tests. 

The method followed in testing was uniform thruout the investiga- 
tion. List I was given on the first day, List If on the second day, List 
III on the third day, and List IV on the fourth day. The same list 
occupied the same relative position in all of the testings. 

The exact procedure in conducting the test was as follows: 

a. Have the children prepare their papers for spelling in their 
usual manner. 

b. Pronounce the word distinctly, but not in such a way as to indi- 
cate its spelling. 

c. Use the word in a brief sentence to illustrate its meaning. 

d. Pronounce the word again and direct the children to write. 


RESULTS 


From the Four Testings with the Four Lists. Tables I, II, and III 
show the percentages of the words correctly spelled on each of the 4 
lists at each of the 4 testings for Grades IV, V, and VI, respectively. 
From Table I, which is no more significant than Table II or III, the 
following points are apparent: 


1. There were 661 children in Grade IV in 17 different school 
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systems who took all 4 word lists at each of the 4 testings. (Two 
schools were in Detroit, but were tabulated separately.) 

2. The number of pupils taking the tests in the different systems 
varied from 5 pupils in Hudson to 195 pupils in Kalamazoo. 

3. On Test I, the range of achievement on List I was from 50 per 
cent correct in the Hunter School of Detroit to 90 per cent correct in 
the school in Washington, D.C.; on List II, from 34 per cent correct in 
the Hunter School of Detroit to 93 per cent correct in the school of 
Washington, D.C.; on List III, from 19 per cent correct in the Hunter 
School in Detroit to 85 per cent correct in the school of Washington, 
D.C.; on List IV, from 9 per cent correct in the Hunter School of Detroit 
to 75 per cent correct in the school of Washington, D.C. It is probably 
a coincidence that the scores on all 4 lists were lowest in the Detroit 
school and highest in the school in Washington. These results must not 
be interpreted as typical of the results which would have been achieved 
in either system as a whole. 

4. On Test I, the standard of achievement for all cities on List I 
was 74 per cent correct, which is 1 per cent higher than the expected 
standard of achievement, but on the other lists there was a gulf between 
the standard of achievement and the expected standard of achievement. 
As is to be expected, the more difficult the list of words the greater the 
gulf. The level of achievement on the different lists of Test I was an 
index of the amount of teaching emphasis which must be exercised 
during the first part of the investigation. It is very interesting to note 
that the children could spell so many words before the active teaching 
was begun. 

5. On Test II, which was given when the teachers felt that the 
children had mastered the 80 words and were ready for additional words, 
the standards of achievement for all cities for each of the 4 lists were 
remarkably high. They varied from 96 per cent correct on List I to 
86 per cent correct on List IV. These high levels of, achievement were 
beyond doubt the results of teaching, and the high scores attained on 
Lists III and IV represent something of an achievement in teaching. 

6. On Test III, given after a lapse of a month from the time of 
completing Test II, during which interval there was no teaching or 
reviewing of the words, the standards of achievement were surprisingly 
high. On List I the average score for all cities was 91 per cent cor- 
rectly spelled; on List II, 90 per cent correctly spelled; on List III, 82 
per cent correctly spelled; on List IV, 77 per cent correctly spelled. 
This means that the loss during the first month was 5 per cent on 
List I, 4 per cent on List II, 7 per cent on List III, and 9 per cent on 
List 1V. While there was slightly larger loss on each succeeding list 
of words, the loss was in no way in proportion to the difficulty of the 
words as manifested by the score on the first test. 

6. On Test IV, given after the summer vacation, the level of 
achievement was also high. It varied from 85 per cent correct on List I 
to 65 per cent correct on List IV. The loss between Test III and Test IV 
on List I was 6 per cent; on List II, 8 per cent; on List III, 10 per 
cent; and on List IV, 12 per cent. Here »gain the losses did not cor- 
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respond to the difficulty of the words as indicated by the scores on the 
first test. 

Similar facts concerning the scores on each of the lists on the dif- 
ferent tests in Grades V and VI are given in Tables II and III. Scrutiny 
of these tables shows that the standards for all cities approximate the 
same tendencies as manifested in Table I. The amounts of permanency 
on comparable lists in the different grades were so nearly the same that 
it seems almost possible to predict what the percentage of permanency 
between any 2 tests will be. 

To emphasize the similarity of the percentages of permanency 
manifested on the different tests, the results were tabulated in Table IV 
showing the percentages of loss on the 4 lists between the different 
tests in the 3 grades. These results show that the percentages of loss 
in the different grades between Tests II and III ranged from 3 per 


TABLE IV.—PERCENTAGE OF LOSSES ON THE DIFFERENT LISTS 
OF WORDS BETWEEN DIFFERENT TESTS IN GRADES IV, 
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cent in Grade VI on List I to 10 per cent on List IV in Grade V, the 
percentages of loss between Tests III and IV, from 1 per cent in Grade 
V to 14 per cent in Grades V and VI. Im general, on the basis of a 
possible 100 per cent, the percentages of loss were remarkably small. 
It is true that the percentages of loss on the more difficult lists were 
greater than on the easier lists, but the size of the losses by no means 
corresponded to the differences in the difficulty of the list of words. 
But even so, the only fair interpretation to place on these facts is to 
say that they indicate that the teaching of these particular lists of 
words resulted in a surprisingly great amount of permanency. 

To supplement the evidence just presented, the summary showing 
the scores on the different lists of words made by all pupils in Grades 
IV, V, and VI who took the first 3 tests is included. This summary, 
given in Table V, indicates that 2,790 pupils from 139 classes had com- 
plete records for all 3 tests. That the number of pupils in the different 
grades as shown in this table should be larger than in the previous 
tables is to be expected when it is recalled that Test IV was given after 
the summer vacation during which several of the teachers had resigned 
or had taken positions in other schools, thus greatly augmenting the 
number of pupils who thru other causes did not take all of the tests. 
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The facts from this table corroborate the 3 points previously men- 
tioned,—that on the first testing there was much difference in the diffi- 
culty of the word lists, that the level of efficiency on the second testing 
was virtually the same on each list of tests, and the loss during the 
interval of a month in which no teaching of the words occurred was 
somewhat greater on the more difficult tests, but on all lists the amount 
of retention was surprisingly great. 

These points would have been even more emphatic if space had per- 
mitted the inclusion of the original tables from which Table V was con- 
structed. These original tables showing the scores made by the indi- 
vidual teachers in each grade on the different lists of words in each of 
the 3 tests indicate clearly that the condition presented in the summary 
is typical of the results achieved by each teacher. Naturally there was 
some variation in the scores achieved by the different teachers in each 


TABLE VI.—DIFFERENCE IN THE CLASS SCORES BEFORE AND 
IMMEDIATELY AFTER THE TEACHING OF THE WORDS OF 
THE VARIOUS LISTS 
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testing, but the scores achieved on the second testing indicated that all 
had done effective teaching of the word lists, and the scores on the third 
testing indicated the teaching had a relatively high degree of per- 
manency. 

To give some idea of the amount of change which resulted from the 
period of teaching, the distributions of actual changes in the percentage 
of words correctly spelled at the beginning and end of th@ teaching 
period are presented in Table VI. This table indicates clearly that 
some teachers brought about as little as 6 to 8 per cent of change on 
the different lists; other teachers brought about from 50 per cent of 
change on List I to 75 per cent on List IV. The fact that little change 
was brought about does not mean, necessarily, inefficient teaching, but 
may mean that the initial level of efficiency was high. The medians 
and the averages of these distributions indicate that the amount of 
change brought about on the different lists was in proportion to the 
difficulty of the lists of words. This condition was to be expected. 

Figure No. I, which is typical of the scatter-diagrams for the dif- 
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ferent lists in the various grades, indicates clearly the relationship be- 
tween the scores made on the initial test and the amount of gain made 
during the period of teaching. It is clear from this scatter-diagram 
that most of the largest gains were made by those classes which made 
low scores on Test I. Most classes making low scores made large gains, 


TABLE YII.—DIFFERENCE IN THE CLASS SCORES ON THE TESTS 
IMMEDIATELY FOLLOWING THE TEACHING OF THE LISTS 
AND IN THE TESTS ONE MONTH LATER 
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altho a few made much smaller gains than other classes. The fact that 
the vast majority of the classes made large gains when the initial score 
was small is significant for it shows conclusively the possibility of the 
pupil’s ability to learn to spell the words even tho the words were much 
too difficult for the grade under consideration. 

Table VII, exhibiting the distributions of the changes in class scores 
on Test III given immediately following the teaching periods, and Test, 
Il given after an interval of a month, was obtained by subtracting the 
class scores on Test III from those on Test II. A negative number indi- 
cates that a higher score was made on Test III than on Test II; a posi- 
tive number, that the score on Test III is lower, as would be naturally 
expected. 

A perusal of this table reveals 3 interesting facts: 

1. There were 62 classes which on some word list or other made 
higher scores on Test III than on Test Il. Whether these higher scores 
on Test III were due to general growth during the period, the incidental 
learning of spelling thru the study of other subjects, the unreliability 
of the tests, or secret coaching, is not known. However, it should be 
added that the Bureau has faith that the tests were administered accord- 
ing to directions and that secret coaching during the month was not a 
factor. 

2. The range in the amount of change varied directly with the 
difficulty of the word lists. On List I in each of the grades the range 
was somewhat less than on List VI, and in turn the range on List II 
was somewhat less than that on List III or on List IV. 

3. The central tendencies, medians, and averages indicate that the 
amount of change varied somewhat with the difficulty of the different 
word lists, but not in direct proportion to the amount of difficulty. On 
List I the amount of change in the different grades represented a loss 
of from 2.3 to 3.8 per cent; on List II, a loss of from 3.0 to 5.0 per 
cent; on List III, a loss of from 3.6 to 7.4 per cent; on List IV, a loss 
of from 6.5 to 10.0 per cent. When comparison of the amount of change 
resulting from the period of instruction was made with the amount of 
loss made during the month’s interval in which there was no teaching, 
the only conclusion reached was that the results of the teaching were 
relatively permanent. 

Figure II is a scatter-diagram which shows the original scores for 
the classes in Grade IV on List I, Test I, in relation to the gains or 
losses between the scores on Test II and Test III. Figure III shows 
similar results for the classes in Grade VI on List IV. Both Figures 
II and III are included because the former represents the scores on the 
least difficult list of words which probably required little effort to master, 
and the latter represents the scores on the most difficult list of words 
which required much effort to master. Similar diagrams were made for 
all word lists and for all grades, but Figures II and III are typical and 
space will not permit including others. 

It is evident from these figures that there is little relationship be- 
tween the initial scores made on Test I and the amount of gain or loss 
made during the interval between the giving of Test II and Test III. 
This is somewhat surprising as it seems reasonable to suppose that the 
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loss on a particular list of words would be greater in those classes 
making low scores on Test I. It would seem that if some classes had 
many more words to learn during the period in which the teaching was 
done they would have many more words to remember and thus more 
chances for losses. However, the horizontal reading of any array of 
losses indicates the same loss for various scores on Test I, and the per- 
pendicular reading of any column indicates that a particular score on 
Test I is accompanied by various amounts of gain or loss. 

These conclusions are true, no matter whether dealing with data 
from easy or difficult lists of words. They point to but one conclusion, 
viz., that the learning how to spell the words under consideration was 
permanent. It should furthermore be pointed out that, while there were 
greater losses on the more difficult lists, the learning on these lists was 
relatively permanent. These conclusions bear little comfort for those 
who claim that teaching children how to spell difficult words results in 
total loss because the children having no use for these words will forget 
how to spell them. Likewise these conclusions bear little solace for those 
who assert that children learn how to spell on the “cramming level”, 
i.e., learning to spell the words just long enough to “get by” the spelling 
recitation and then feeling free to forget them. 

However, these generalizations must not be applied too readily to 
the teaching of spelling in general for it is possible that the conditions 
in this investigation were not typical of all teaching of spelling. First 
of all, it is probable that the best and most progressive teachers of 
spelling took part in this investigation, altho there was no effort to 
make such a selection. In the second place, experience has shown that 
better results are obtained when teachers are participating in investi- 
gations in which the results are frequently measured. In the third 
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place, there is reason to believe that the teachers participating in this 
investigation taught on a much higher plane than usual. A few teachers 
reported that in teaching the most difficult of the words the usual 
methods of teaching failed and that special effort had to be put forth 
and new methods of teaching had to be utilized. One teacher reported 
that she had to give up the regular class spelling recitation and spend 
her time in coaching individual pupils or in directing the better pupils 
as they coached the poorer ones. This teacher reported that the pupils 
thought the task of learning these words was very painful when com- 
pared to the learning of the words given in their regular speller. 
Regardless of whether the conditions of this investigation were 
typical of the conditions under which spelling is usually taught, it 
seems safe to suggest that failure to get permanent results from the 
teaching probably indicates poor methods of teaching. It would be a 
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very interesting study to determine which methods of teaching spelling 
secure the greatest permanency of results, but this particular aspect was 
not emphasized in this investigation. 

Another aspect of the permanency of the teaching of spelling is the 
persistence of certain errors in the spelling of the words. Since each 
child considered in Tables II, III, and IV spelled each word of the 4 
lists on the 4 different tests it would have been possible to compare the 
4 responses had all cities sent in the original test papers. The original 
test papers were returned from Kalamazoo, and from these data a tabu- 
lation of persistence of certain types of errors was made. This tabula- 
tion involved only a few pupils inasmuch as some of the papers were 
apparently scattered or misplaced and it was impossible to obtain 4 
different responses for each child. 

The complete records of two pupils in Grade V on List I are in- 
cluded in order to make clear the nature of the task involved. As will 
be noted, the 2 pupils selected were poor spellers, but their responses 
illustrate many aspects of the permanent influence of learning. 

The first columns of Tables VIII and IX show the words of List I; 
the blank spaces in the column under the enumerated testings indicate 
that the correct responses were given to the stimulus word; the given 
misspellings under each of the enumerated testings indicate the re- 
sponses given the specified word on each particular test. At times it 
was difficult to determine from a child’s writing the exact response he 
had made, but in all cases special care was taken to reproduce his 
response. 


TABLE VIII.—COMPLETE RECORD OF RESPONSES GIVEN BY PUPIL 
NO. 1 TO THE WORDS OF LIST I ON EACH OF THE FOUR TESTS 























| | 

Worp Test I | vTesr ir | Test III Test IV 
soldier soldger ] mye eyes solder | solager 
barrel barell Serr eee rey barel | barerl 
re Bac aides A al ab so eckin SSA ROPE ee Fe | aganst 
break brack band pea oe brok | brake 
continue | contenue | contenue contenue | countinyou 
collar collor es eee er ertee 
delayed delade | delayd deladlade | delade 
crippled crokola creppeld | eripild | ciripeld 
gotten ee Bc bor oe i ee Sian ....| cotten 
gasoline NON = i ea te ..| gasolean | casalen 
husband hosbon hesbon husbon | husbon 
headache hadack hadack hadack | hadack 
outlined peeeeee..: Beeler ees ..| Outlindind 
machine ne: BR ddatir antes ccd: mashen | mashen 
plainly | planley planly planly | planly 
measure macure mashure mase | masher 
shock shot Soe eee spik shop 
stories stores stoers storyes storey 
skirt schert skert sheart schert 
uniform unuform unaform unaform yonafrom 

















EDUCATIONAL MEASUREMENTS CONFERENCE 97 


From Table VIII the following facts are evident concerning the 
spelling ability of Pupil No. 1: 

1. On Test No. 1 he misspelled 19 of the 20 words given; on Test 
II, 10 words; on Test III, 16 words; and on Test IV, 19 words. It is 
noted that on Test II he spelled the following words correctly: “soldier”, 
“barrel”, “against”, “break”, “collar”, “gotten”, “gasoline”, “outlined”, 
“machine”, “shock”. On Test III, he spelled correctly but 4 words: 
“against”, “collar”, “gotten”, “outlined”; on Test IV, he missed all 
words but “collar”. It is interesting that he misspelled the word 
“against” on Test IV when he had spelled it correctly before it had 
been taught and on each of the 2 succeeding tests. 

2. There was some tendency for certain errors of spelling to per- 
sist, but this tendency was not universal. The word “continue” was 
spelled “contenue” on Test I, Test II, and Test III. The word “gasoline” 
was spelled “gasolean” before the words were taught; it was correctly 
spelled on Test II but on Test III, after a lapse of a month in which 
there was no teaching, it was spelled exactly as on Test I. On Test IV, 
it was misspelled quite differently from either of the other misspellings, 
which fact suggests a misunderstanding of the pronunciation of the 
word. There was some variation in the misspelling of the first syllable 
of the word “husband” but the last syllable was always “bon”. The 
word “headache” was spelled “hadack” on all 4 tests. “Plainly” was 
spelled “planly” on Tests II, III, and IV. “Shirt” was spelled “schert” 
on Test I, “shert” on Test II, “sheart” on Test III, and “schert” on 
Test IV. “Uniform” was misspelled on each of the 4 tests, but certain 
errors tend to be manifested in each misspelling. On other words of 
the lists the responses did not have constant errors. 

3. There was some evidence that these individuals tried to spell 
the words as they sounded. In spelling “soldier” the tendency was to 
spell the last syllable “ger”; the “ue” sound of continue and the “un” 
sound of “uniform” at times were spelled “you”; the “band” of “hus- 
band”, “bon”. Other illustrations are evident in the list, but sufficient 
examples have been given to establish the point. 

4. Numerous errors occurred which apparently bore little relation- 
ship to the words or to anything connected with them. The words may 
have been misunderstood or the individual may have made random 
errors. Some words were misspelled thru the doubling or lack of 
doubling certain consonants. There were other errors which were easily 
recognized but they will not be mentioned, as the chief interest at 
present is the persistence rather than classification of error. 

Table IX showing responses given by Pupil No. 2 merely cor- 
roborates the conclusions cited for Pupil No. 1. This individual mis- 
pelled 18 of the 20 words on Test I; missed but 4 words on Test II; 
missed 16 words cn Test III; and missed 17 words on Test IV. The 
teaching given during the interval between Tests I and II was tempo- 
rarily fairly effective, but its permanent effect was negligible. There 
is some evidence of the persistence of error in the case of this indi- 
vidual, e.g., “hushand” for “husband”, “solder” for “soldier”, but the 
tendency is not so strong as in the case of Pupil No. 1. There is much 
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TABLE IX.—COMPLETE RECORD OF THE RESPONSES GIVEN BY 
PUPIL NO. 2 TO THE WORDS OF LIST I ON EACH OF THE 








Worp Test I Test II Test 111 | Test IV 
soldier solder a gee 0! Ree et 
barrel eee 1000 0V ewes caa|eVe taaateee | brall 
against agen’ t agianst aggent agent 
break bright brack ES 8!) Bisse Se oe, 
continue content Pings Peele | contain | continot 
collar corral Hi be naa Be ese ee oa am Se ae 
delayed dealy thais cated ys aac | daly 
crippled cruple ....| cupped cripling 
gotten goen eee ts ....| gotton gotgone 
gasoline gasellen witb Salat shot, 3 bso gasolline 
husband humber Ree”... "exec nnsees | hushand 
headache Tre 9 kts Sale et re headach 
outlined outlind i tiivehevatras | outlind 
machine meachen ....| mechine | michuse 
plainly plandley ee blandle 
measure meauner mauser | musure mearsure 
shock shald eye << anmel e shoke 
stories storys eee story 
skirt suretry eee. | scuret 
uniform wrenumform ........ ....| unfuren | youneyfor 


evidence that Pupil No. 2 tried to spell the words according to the 
sound, e.g., “outlind” for “outlined”; “youne” for the “uni” of “uni- 
form”; “brall” for “barrel”; “gotton” for “gotten”, etc. There is much 
evidence of random error due either to inaccurate hearing of the words, 
inaccurate associations, or random guessing. 

It would be enlightening to present individuai reports for other 
pupils, but space will not permit. A general summary for a group of 
pupils in Grades IV and V is presented in Table X. From the first 
horizontal line of figures in this table it is evident that the influence of 
teaching was permanent on 19 to 38 per cent of the words. It should 
be added that the percentage of permanency was as great on Lists III 
and IV as on Lists I and II. The second horizontal line shows that 
from one-fourth to almost two-thirds of the words misspelled on 
Test I were misspelled on either Test II, III, or IV. The third horizontal 
line of figures indicate that from 6 to 18 per cent of the words mis- 
spelled on Test I were taught well enough to be spelled correctly on 
Tests II and III, but not weil enough to be spelled correctly after the 
summer vacation; the fourth line, that from 6 to 19 per cent of the 
words missed upon Test I were taught weil enough to be spelled cor- 
rectly on Test II but not sufficiently well to be spelled correctly on 
Tests III and IV. Thus the second, third, and fourth horizontal lines 
of figures suggest the lack of the permanent effect of teaching. 

The last 3 columns indicate the persistency of certain errors. From 
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1 to 2 per cent of the words missed on List I were spelled correctly on 
Tests II and III, but were misspelled on Test IV exactly as misspelled 
on Test I; from 2 to 6 per cent of the words were misspelled in exactly 
the same way at all times when misspelled; from 6 to 14 per cent of 
the words were not misspelled exactly in the same way, but there was 
a very strong tendency for them to be misspelled in the same way. 

These last columns indicate that in from 8 to 20 per cent of the 
words the errors made in spelling tend to persist. It seems then that 
errors persist in spite of the influence of teaching. The fact that they 
do may be a warning that the practice of having the children attemp: 
spelling words before studying them needs modification so that the first 
impression of the spelling of an unknown word will be correct. The con- 
dition at least suggests the need for a scientific evaluation of the practice. 

RESULTS 

The outstanding conclusions from this investigation are as follows: 

1. From 50 te 75 per cent of the words on Lists I and II and from 
30 to 47 per cent of the words on Lists III and IV were spelled cor- 
rectly before active teaching was begun. This sfiggests that unless the 
words are pronounced to children before their study period that much 
effort may be wasted in the indiscriminate attacking of words. 

2. The period devoted to teaching was very effective as the per- 
centage of correctly spelled words on Test II was on the average above 
90 per cent. When the difficulty of some of the word lists is considered, 
this high level of efficiency is surprising. 

3. The level of efficiency on Test II bore little relation to the level 
of efficiency on Test I. The classes which had many words to learn 
achieved approximately the same scores on Test II as the classes which 
had few words to learn. The fact that the teachers were allowed to 
emphasize the words until they ‘were mastered well enough to warrant 
the teaching of new words may be responsible for this condition. 

4. The permanent effect of the teaching of these words was rather 
surprising. On List I, the loss between Tests II and III was from 3 to 5 
per cent, and between Tests III and IV from 1 to 6 per cent; on List II, 
between Tests II and III from 4 to 5 per cent, and between Tests III 
and IV from 6 to 10 per cent; on List III, between Tests II and III 
from 6 to 8 per cent, and between Tests III and IV from 8 to 11 per 
cent; on List IV, between Tests II and III from 6 to 10 per cent, and 
between Tests III and IV, from 12 to 14 per cent. 

5. There was little relationship between the difficulty of the words 
and the degree of permanency manifested. While the actual amounts 
of loss on Lists III and IV were greater than on Lists I and II, the dif- 
ferences in the amounts of loss were not in proportion to the differences 
in the difficulty of the word lists. The children actually retained to a 
surprising degree the knowledge of how to spell the words of List IV, 
words normally given to classes one year and a half in advance of these 
classes. As it seems safe to conjecture that the children of these grades 
had little opportunity to use these words in expressing their thought 
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in writing, this condition rather routs the contention that children will 
forget how to spell words for which they have no use. 

6. There was a definite tendency for certain errors to persist re- 
gardless of the influence of teaching. From 2 to 6 per cent of the words 
misspelled showed identical mistakes and from 6 to 14 per cent.of the 
words, almost- identical mistakes. While the percentages of persistent 
errors were much smaller than the percentages indicating the permanent 
effects of the teaching, they were sufficiently large to suggest that the 
practice of testing-before-study method of teaching spelling needs scien- 
tific justification. At any rate these percentages of persistent errors 
suggest that it may be bad practice to ask a child to spell a word which 
he knows he cannot spell. Would it not be better in the testing-before- 
study method first to have the child indicate the words which he knows 
or thinks he knows and test on these and not test on the other words 
until after a study? 

7. The conclusions ‘stated above were derived from a sufficiently 
" large number of teachers and pupils to be reliable, but it is not asserted 

that the conclusions will be found in everyday teaching. In all prob- 
ability the teachers who volunteered for the investigation were superior 
teachers and in all probability more permanent results were derived 
because they were participating in an investigation where frequent 
evaluations were being made. However, if the teachers who partici- 
pated in this investigation could obtain permanent effects from their 
teaching of spelling it seems reasonable to suppose that other teachers 
can also, if proper effort is put forth. 

8. In this investigation a few teachers obtained more permanent 
effects than others, but no attempt was made to ascertain the factors 
underlying the greater permanency. It is suggested that a future in- 
vestigation emphasizing various methods of teaching on the permanency 

of learning would be most worth-while. 














Individual Development as Shown by Repeated 
Measurements 


WALTER F. DEARBORN, Psycho-Educational Clinic, Graduate School of 
Education, Harvard University 


WITH the aid of a generous subvention of the Commonwealth Fund 
and the zealous codperation of the staff of the Psycho-Educational Clinic 
and students of the Harvard Graduate School of Education, the writer 
inaugurated in the fall of 1922 a comprehensive investigation of the 
mental and physical development of school children. During the course 


of that academic year, over 5,000 school children, most of them in the. 


first and second school grades, were given an extensive series of measure- 
ments with the intention of repeating these measurements annually on 
the same children thruout the period of growth, or as long as they 
remain in school. Two group tests of intelligence and tests of school 
attainments were given to all, and over 1,000 children were examined 
individually with the Stanford-Binet. The physical measures included 
measurements of height and weight, of bodily proportions, dentition, and 
ossification as shown by X-ray photographs of the carpal bones. These 
measurements were supplemented by records of teachers’ judgments in 
regard to intelligence, behavior, and other pertinent characteristics of 
these children, and by information gathered in regard to sex, race, 
health, and home conditions. The investigation is now in its second 
year. I wish to set before you the reasons or justification for such an 
elaborate study, some of the problems in view, and to review the results 
of the first year’s work. 

The need for repeated observations of the mental and physical de- 
velopment of school children during the period of growth has long been 
recognized. The writer’s first interest in the problem began with a 
study made fifteen years ago which showed the comparative constancy 
maintained in the relative standing of pupils in their progress thru 
school as judged by school marks.* The majority of individuals studied 
were shown to have maintained approximately the same rank in the 
grade school, high school, and college; subsequently it has been shown 
that relative standing in college is a good index of relative standing in 
the professional schools. 

Repeated measurements of physical growth have similarly disclosed 
facts which previous comparisons in terms of the averages of different 
age groups had missed. The extent to which individuals maintain 
thruout the period of growth their initial superiority or inferiority in 
physical stature, weight, etc., has been graphically shown by the studies 
of Kammerer, Wissler, and Baldwin. 

The most significant result from the repeated application of mental 
Bulletin of the University of Wisconsin, No. 312, High School Series No. 6, August, 1909. 
(102) 
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tests to the same individuals has been the demonstration of the relative 
constancy of the intelligence quotient under similar conditions of environ- 
ment and training. 

These studies would seem to indicate that the mental and physical 
development of the majority of individuals was gradual and regular. 
The early study of Wissler, however, of repeated measurements of 
height and weight showed that children who had been growing more 
slowly than the average up to the adolescent period might in a few 
years of rapid growth pass well above the average, and, contrariwise, 
children who had been growing rapidly up to this period might so slow 
down in the increments of growth as to fall below the average. 

The recent and exceedingly important study by Porter of the growth 
of Boston school children (based on ten years of repeated measure- 
ments) shows the extent of such variations in the increments in height 
and weight between the ages of five and thirteen inclusive.* 

The repeated tests of mental growth have not been carried into the 
adolescent period in sufficient numbers to warrant conclusions, but it is 
quite likely that in many cases there will be a similar lack of constancy 
in the 1.Q. at this level, that is, children whose mental development in 
the early years has been slow may make up for this retardation by 
later acceleration. This is certainly true in the exceptional cases among 
whom are, of course, many of our problems. Our experience in this 
respect with a limited number of cases is supplemented by the experi- 
ence of Cyril Burt in the London public schools. In speaking of the 
occasional improvement which he has noted in the mental ratios or intel- 
ligence quotients of special class children, he makes the following inter- 
esting observations: 

“In every case, with one dubious exception, the subsequent history 
unequivocally suggests that the partial restoration must be connected 
with some deeper cause than mere accident or freak of fortune. That 
cause appears to be an intrinsic irregularity of mental growth. Such 
children are creatures of deferred maturity. Their developraent is not 
arrested; it has been postponed. Although upon a lower plane, their 
mental growth runs parallel with that of many cleverer children, in 
whom the phenomenon is more familiar. There is many a sharp child 
whose cycle of growth is like that of the mulberry tree, presenting first 
a long delay, and then a sudden yield of flower and fruit together. 
Their existence is recognized in the double scholarship examination. In 
London at the age of thirteen a second examination has been instituted 
specifically for those who in the current phrase ‘bloom late’, and whose 
anticipated powers, therefore, do not ripen by the age of ten. In like 
fashion, among the classes for defectives, time and due season will here 
and there disclose a sporadic ‘school autumnal’.” 

It is important to determine how general this phenomenon is and 
how it can be recognized in a given case; that is, are there any ways 
by which we can determine whether a backwardness in development in 
the early years is only a temporary matter or whether it is indicative 
of a permanent lack which will not be overcome? It is in the hope of 


. Porter, W. T. “The Relative Growth of Individual Boston School Boys,” American 
Journal of Physiology, Vol. LXI, No. 2, July, 1922. 
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finding such differential criteria that we believe it to be desirable to 
have physical and physiological tests repeated on the same individual 
on whom we are making the repeated mental and scholastic tests. 

The relation of physical and mental development has long been a 
mooted question. We know that in the case of physical development 
taken by itself there are alternations in the time of acceleration and 
retardation in the growth of different parts of the body, and it has been 
believed by many that there are similar alternations between physical 
and mental growth: that during periods of rapid physical growth, there 
is what Superintendent Greenwood of the Kansas City schools used 
twenty-five years or more ago to call a “superabundance of inertia” 
from which school work will suffer. The question at issue has never 
been satisfactorily answered. Most of the studies of physical develop- 
ment have depended on but a single measure, such as height or weight. 
It is our opinion that an average of a number of different measures, 
such as we are applying, may give us a better index of general physi- 
cal development. This has been the experience in the field of mental 
tests: single tests were inconclusive; whereas the combining or averag- 
ing of several tests has given a useful measure of the general intel- 
lectual development of the individual. Our opinion, which may or may 
not be substantial in the course of this investigation, is that such a 
measure of general physiological development taken together with our 
present measures of general intellectual development may enable us to 
make the necessary analysis and differentiation between the various 
factors effecting development. The mental age as now secured is the 
result of several factors: native intelligence, relative physiological ma- 
turity, physical health and rigor, and environment, the last including 
special training or practice. The analysis which we have in mind to 
make between these factors may be illustrated by short quotations from 
published statements of the speaker and the description of two or three 
cases upon whom some of the above-described observations and measure- 
ments have been made. 

“A child somewhat backward in mental development, whose yearly 
increments in mental age have been small and who on repeated examina- 
tion proves to be correspondingly backward in general physiological 
development, may frequently make up for his slow start before he 
reaches maturity. The prognosis in his case would be better than in 
the case of a child of the same early mental level but who, at the same 
time, is found to be physiologically well along in the course of develop- 
ment. The fact that he has come on so well in general physical develop- 
ment in the early years without corresponding mental growth would 
make his prospects less hopeful. Similarly, some of our much-heralded 
prodigies, who have rather petered out in later years, may prove to 
have maintained their relative superiority for a few years because of 
early maturing, supplemented by a kind of hot-housing.” 

“Among the cases observed in which slow or average development 
in the early years has been followed by rapid physical and mental de- 
velopment later on in the period of growth is that of a girl who at the 
age of 9 years and 11 months had a mental age, as determined by the 
intelligence tests, of 9 years and 10 months, and an intelligence quotient 
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of 99. Three years later she attained in the tests a mental age of 17 
years and 8 months and an intelligence quotient of 136,—a gain of 37 
points. At the time of the first examination she was at about the 68th 
percentile in physical height of girls of her own age; at the end of the 
three years she was at the 87th percentile. The acceleration in mental 
development was thus accompanied by a somewhat corresponding ac- 
celeration in physical growth. 

“A girl who entered school at the age of 6 had, as a result of 
systematic instruction which her parents had begun when she was 3 
years old, covered the regular work of the first four or five grades of 
school. She secured on examination with the general intelligence tests 
a mental age of between 11 and 12 years and an intelligence quotient 
well over 160. In a series of performance tests for which this previous 
coaching had not prepared her, she did but little better than children 
of her own age. Although physically weak, of slender build, and fre- 
quently ill, there was some evidence of a certain physiological develop- 
ment in advance of her age which may have been in part the result of 
the intensive training and hot-housing to which she had been subjected. 
Her present superiority in the mental and scholastic tests would appear 
to be due in good part to these factors. If this is the case, the following 
are some of the possibilities in her subsequent development: (1) she 
may continue her unbalanced development with a resulting freakish 
intellect or genius within very narrow limits; (2) the physiological 
changes of adolescence may be completely unsettling with a nervous 
break-down and the development of psychopathic traits; (3) the demands 
of general somatic development may become such that the initial ac- 
celeration in development proves purely temporary and the child settles 
back to the general level of mediocrity. 

“A different result is indicated in the case of a boy first examined 
in 1918 at the age of 5 years and 9 months. In four successive annual 
examinations his intelligence quotient has closely approximated 100. 
His parents are both of exceptional abilities, and, because of the child’s 
general health and some suspicions of defective heart action, they have 
let nature take its course in the child’s development. His present mental 
status is, it is believed, due chiefly to his native intelligence. Physical 
and physiological measurements and indices, including dentition and os- 
sification, indicate slow development—a condition which may by the time 
of the pubertal acceleration lead to his passing well above the general 
average of his age. The prognosis is one which repeated measurements 
can alone at present test.” 

The results of our present studies of ossification also point to the 
accuracy of the above method of analysis. Altho this is but a single 
measure of physical growth, it appears to be one of the best and may 
prove to correlate well with the average of measures just as, for ex- 
ample, the tests of vocabulary in the. Stanford-Binet tests have been 
shown to correlate well with the results of the combination of tests com- 
prising the scale as a whole. 

I may speak first of some of the general findings: 


“X-ray photographs of the ossification of the carpal bones have been 
made in all grades and ages of children from the first through the high 
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school. An objective method has been established by Dr. D. S. Prescott, 
research assistant in the investigation, for measuring the individual dif- 
ferences and the changes observed in the various age groups. These 
differences are expressed in terms of so-called anatomic indices. The 
results of the first 3,000 cases have been presented in a recently pub- 
lished ‘Monograph on the Determination of Anatomical Age in School 
Children and its Relation to Mental Development’ (Harvard Monographs 
in Education, Series I, No. 5). 

“A wide range of variation is found in children of the same chrono- 
logical age. Boys of 6 years, 3 months, to 6 years, 8 months, inclusive 
have ratios varying from .5 to 2.20, or, expressed in terms of anatomi- 
cal ages, from 4 years to 9% years. Boys thus differing but 6 months 
in chronological age differ by as much as 5% years in anatomical de- 
velopment. These ratios are distributed symmetrically about their mean 
for a given age and have the same range of variability which has been 
found in the mental ages of children. As has been shown to be the 
case of mental ages, this variability doubtless increases in direct pro- 
portion to the chronological age.” 


The relations between physiological development, as gauged by this 
anatomical index, and mental development have been studied in the 
above-cited monograph by means of simple and partial coefficients of cor- 
relation. The results are open to different interpretations. A more 
direct method of comparison, and one less open to difference in inter- 
pretation, is to state the relationships in terms of probability, that is 
to show what the chances are of a relationship holding. This, Dr. 
Prescott, in collaboration with the speaker, has done in a recent study 
which will shortly be published and from which I may cite the following 
results: , 


“In an unselected group of 757 children, comprising all the first 
grade children in a single school system, it was found that the intelli- 
gence quotients, as determined by individual tests, varied from .50 to 
1.70 and were distributed roughly according to the probability curve. 
Of this group 387 individuals, or 51 per cent, had anatomic indices above 
the medians normal to their ages and sexes, and 370 individuals, or 49 
per cent, had anatomic indices below the proper medians. Thus the 
proportions above and below were about even.” 

“If there were no relationship between anatomical and mental de- 
velopment, as some of the partial coefficients of correlation seemed to 
show, then these same proportions ought to hold for the various groups 
making up this composite or unselected group. That is, roughly 50 
per cent of the individuals with I.Q.’s below 80 ought to be found both 
above and below the medians proper to their ages and sexes. Also the 
same proportion of individuals with 1.Q.’s above 110 or above 120 ought 
to be found above and below the medians. But examination does not 
show this to be a fact. Of the 55 individuals below 80 in I.Q., only 16, 
or 29 per cent, had anatomic indices above the medians proper for their 
ages and sexes, while 71 per cent had anatomic indices below the median. 
On the other hand, of the 110 individuals having I.Q.’s above 110, 70, 
or 64 per cent, had anatomic indices above the medians of their ages 
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and sexes while only 40, or 36 per cent, had anatomic indices below the 
median.” 

“Further support of these general findings is given by an analysis 
of the anatomic indices of the 110 children with I.Q.’s above 1.10. Of 
these, 76 had intelligence quotients between 1.10 and 1.20. Of this num- 
ber 46, or 61 per cent, had anatomic indices which were above the norm, 
and 30, or 39 per cent, had anatomic indices which were below the norm. 
Of the 34 individuals who had intelligence quotients of 1.20 or above, 
24, or 71 per cent, had anatomic indices above the norm, while only 10, 
or 29 per cent, had anatomic indices below the norm. This tends to 
show that the greater the deviation from the normal in mental develop- 
ment the greater the chances of deviation in anatomical development. 

“Evidence of the same sort is found in the study of 349 feeble- 
minded children from the Massachusetts School for Feeble-minded at 
Waverly. Of this number 235, or 67 per cent, had anatomic indices 
below the medians normal to their ages and sexes, while only 114, or 
33 per cent, had anatomic indices above the norm. Thus, for the feeble- 
minded, the chances are 2 to 1 that an individual will be below normal 
in anatomical development as well as in mental development.” 

“It appears significant that the group of mentally normal children 
with 1.Q.’s from .80 to 1.10 should be distributed in the normal manner, 
that 71 per cent of the individuals with I.Q.’s below 80 should have 
anatomic indices below the norm, while 71 per cent of the individuals 
with 1.Q.’s of 120 or above have anatomic indices above the norm. It 
is equally striking that the chances are 2 to 1 that a feeble-minded 
child in an institution will have an anatomical development below the 
norm while the same chances hold that children with 1.Q.’s of 115 or 
above will have an anatomical development above the norm. The chances 
are seen to be about 2 to 1 throughout that an approximate relationship 
between anatomical and mental developments will hold for deficient, 
normal, or gifted children.” . 

“The significance for education of this positive 2 to 1 relationship 
cannot be stated until the various groups have been examined as they 
mature, for how long development continues is a factor which deter- 
mines the final level as well as how fast the individual grows. Certain 
possible conclusions of significance can be noted, however. 

“An anatomical index greater than the median for an individual’s 
age means that the individual is more mature than is usual at his age, 
that he is nearer the end of his development, and inferentially, at any 
rate that he will not need to continue growing as long as is usual in 
order to reach maturity. If there is a common causal factor in two- 
thirds of the cases between mental and physical growth then the infer- 
ence for mental growth is clear. It, too, may not continue as long as is 
customary. During growth, however, the child’s brightness is measured 
by the average and the final outcome may be misjudged. That is, an 
individual might have an I.Q. of 135 when judged by the average, but 
if his maturity level were taken into consideration we might find his 
true index of brightness to be 120. Is this not a possible explanation 
for the adult mediocrity [above mentioned] of some of our infant 
prodigies, and of some of our ‘star’ pupils of the elementary school who 
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do not shine in later life? Of course, there are personality factors to 
be considered too, but the maturity factor does not seem to be without 
significance. 

“On the other hand, an anatomical index below the norm means an 
individual less mature than the average with the possibility of a longer 
continuance of both mental and physical growth than is usual. May not 
this account for the adult success of some of our 110 I.Q. children whose 
school accomplishment was not outstanding? May it not even account 
for some of the good solid citizens who were judged dullards in school? 
How often we hear it said, ‘I never thought that he would amount to 
much, but he has a fine job and seems to be making good.’ 

“As heretofore stated, repeated measurements of the same indi- 
viduals as they mature both physically and mentally are necessary before 
the significance of the maturity factor can be justly evaluated. For 
the great group of average people whose development both physically 
and mentally is near to the average the anatomical age has no signifi- 
cance other than to know that it is near the level normal to the indi- 
vidual’s chronological age. But for those children who deviate from 
the average to any considerable extent in either anatomical or mental 
development, to know the anatomical maturity as well as the I.Q. in 
terms of chronological age is to have an item of information of value 
to teachers and psychologists alike.” 


The same problem may be illustrated by a consideration of sex 
differences in anatomical development. Girls of 6 years of age are as 
far along in anatomical development as boys of 7% years, and fifteen- 
year-old boys are not more than 6 months ahead of the thirteen-year-old 
girls in spite of their two-years’ superiority in chronological age. Girls 
are in general at each age about 18 months ahead of boys by this index 
from their entrance into school until maturity. 

If our argument in regard to the effect of maturity holds, girls, 
assuming, for the present, equality in native endowment, should be 
superior at every age to boys in school accomplishment. Mr. E. A. 
Lincoln, of our staff, has recently made a comprehensive survey of the 
literature on school tests in regard to sex differences. Altho there are 
occasional exceptions and irregularities, the results of school tests do 
show pretty uniformly a slight superiority of girls to boys in the speed 
and quality of silent reading, in spelling, handwriting, and in composi- 
tion, and speed and accuracy in the fundamental processes in arith- 
metic (with lack of agreement in regard to relative excellence in arith- 
metical problems or reasoning). The only clear cases of superiority 
of boys to girls are in certain tests of history, notably in the historied 
information tests. 

Furthermore, Mr. Lincoln notes: “Studies of progress through 
school indicate a superiority of girls in school accomplishment since 
they are less frequently retarded and more often accelerated, and there 
are greater withdrawals of boys. Girls are therefore working with 
boys who are somewhat older and somewhat more selected. This means 
that comparisons by tests cannot be taken at their face value. The 
superiority of girls in great comparisons is really greater than is indi- 
cated by the actual central tendencies.” 
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With all allowances, however, it must be recognized that the 
superiority of the girls is not as great as their relatively greater 
maturity would seem to make likely. Of course it is to be remembered 
that our argument is based on the initial assumption of equality in 
native endowment, and it would be possible to argue (altho I should 
personally hardly hazard it) that the native endowment is not equal, 
and that, since the girls are much more accelerated in development, this 
enables them to excel boys of their own age in learning but not by as 
much as they should if their native endowment were equal to that of 
boys. 

When we come to consider the results of the intelligence tests of 
boys and girls the problem appears even more complicated. The results 
of the Stanford-Binet tests as reported by Terman do show to be sure 
a slight superiority in the favor of girls at all ages. On the other hand, 
in the Dearborn tests of intelligence the boys are at most ages superior 
to the girls. Mr. Lincoln has also made these comparisons. I will 
quote briefly from the findings of his analysis which is based on “the 
results of some 3,400 tests given to the school children from the second 
grade thru the high school. Every child in these grades was given the 
examinations, and therefore there seems no possibility of any selective 
factor to invalidate the results. Selection was further guarded against 
by giving the examinations in three different communities in none of 
which there seemed to be any extraordinary conditions. That the chil- 
cren of these communities were about average mentally is indicated by 
the fact that the median I.Q.’s for the various ages did not fall below 
96 or above 104 in any of the communities.” 

“Examination of the data showed that for each of the ages from 
7 to 16 inclusive there were 100 or more cases of each sex, except that 
there were only 98 boys at the age of 16. 

“It appears that there is, at most ages, a slight difference in favor 
of the boys, but it is not large and is not constant, since the girls excel 
at 8, and 14 years, and the medians are practically identical at 9 and 11. 
All the medians appear to be reliable, as indicated by the probable 
errors. a 

If now we inquire why the intelligence tests fail to note the dif- 
ference in maturity between the sexes which seems to be clearly indi- 
cated by the anatomic indices, and if on the principle of “safety first”, 
we again assume an equality of the sexes in native endowment, we may 
first attack the validity of the tests. In the first place, we may note 
that the tests have for the most part been devised by men. If that 
consideration does not on its mere enunciation make much of an appeal, 
I would respectfully ask your attention to a discussion (in a recent 
monograph by the speaker and collaborators on “Form Board and Per- 
formance Tests of Intelligence” which it is not possible to review here) 
of the way in which the devisers of intelligence tests have chosen- tests 
in which their own intellectual powers do not suffer by comparison. 

A second consideration may perhaps seem more important, namely, 
the extent to which success in the current intelligence tests depends on 
chooling. Burt by means of }.. rtial coefficients of correlation has esti- 
mated that perhaps two-thirds of the score in the Stanford Intelligence © 
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Tests is a result of school training. Gordon, in another very important 
English study, has shown a direct correspondence between the length of 
time spent in school and the test results. He has further shown in the 
case of certain gypsy and canal boat children, with little schooling, that 
the youngest member of the family is almost always the brightest, and 
the oldest the dullest. These children at about 6 years of age compare 
favorably with children who are starting in school, that is they have 
1.Q.’s between 90 and 100. When their older brothers and sisters are 
compared by means of the tests with children who have been a number 
of years in school (since the tests have been standardized on school 
children) they appear to have I.Q.’s from 90 to 70 or below. 

The bearing of these findings on the problem of sex differences may 
now be stated. Girls enter school at the same age as boys, progress in 
their studies and are promoted at only a slightly faster rate, and are 
therefore not able to take advantage of their relatively greater maturity. 
If one adds to the effect of schooling the general environmental influ- 
ences which treat each chronological age irrespective of sex, in about 
the same way, the argument appears stronger. (Inadvertently, the 
writer’s group tests stress school training somewhat less than the Binet 
tests and draw a little more on the general knowledge gained outside 
of school, and also possibly draw on the knowledge boys acquire more 
often than girls. This may account for the, after ail, minor differences 
in findings between the tests.) 

Of course, someone else than the speaker may again decide to argue 
that girls are natively not quite as intelligent as boys, but that this 
lack is compensated for during the period of growth by their relatively 
greater maturity. 

A somewhat less argumentative case on which it is hoped that light 
may also be cast by the present investigation is that of racial differ- 
ences. I will review the matter very briefly: 

In one of the communities studied there are about equal numbers 
of Italian children, Jewish children, and children of an older “American” 
stock. The Jewish children rank highest in mental age and in intelli- 
gence quotients, then come the American children, and last the Italian. 
In anatomical development, as indicated by the stage of ossification of 
the carpal bones and expressed in terms of anatomical ratios, the Jewish 
children are far ahead of the other two racial groups. The Italian boys 
appear slightly more developed than the American boys, whereas the 
American girls are a little more developed than the Italian girls. The 
small difference between these latter groups may be affected by a wider 
sampling of cases, but the position of the Jewish group can hardly be 
changed. 

Taking for purposes of illustration this outstanding difference of 
the Jewish group, we should be inclined to argue that the superiority in 
intelligence was in part due to the greater anatomical and physiological 
development, that they are, in other words, simply further along in the 
stage of growth; that the slower growing American group will either 
develop for a longer period or that their growth will be more accelerated 
during its latter stages, e.g., at the adolescent period, so that at the 
completion of the period of growth the initial disparity between the 
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intelligence of the two groups would disappear. This result would 
depend upon the duration of the period of growth and the amount of 
later acceleration. If both factors were operative the American group 
may in the end excel in intelligence. - member of another race might 
well be inclined to argue otherwise! A solution of such a problem must 
await the comparisons of the yearly increments in anatomical, physio- 
logical, mental, and scholastic development. 

The above illustration suggests a comment on the much-discussed 
differences in the intelligence of negro and white children. It would 
appear from the use of such intelligence tests as the Stanford-Binet 
that negro children are relatively brighter in the early years than in 
later years, that at school entrance at about the age of 6 they are the 
equals in intelligence of white children of the same social status, but 
that with each successive year they are progressively less bright as 
compared with white children. (See e.g., Arlitt’s study.) 

Gordon in the significant study above cited of physically defective 
canal boat and gypsy children of little or no schooling has made exactly 
the same finding. He has shown as above noted that at age 6 these 
children are on the average of normal intelligence with intelligence 
quotients between 90 and 100, according to the test, but that with each 
successive year the intelligence quotients decrease on the average until 
by the age of 10 or 12 they average .70. The older children in the 
same family are relative to their life-age less intelligent than the 
younger children. Racial differences do not enter at all into some of 
these groups. Gordon attributes the results to lack of schooling and 
social isolation and presents conclusive evidence in support of this con- 
tention. He finds positive correlation between the amount of school 
attendance and the mental an1 educational ratios of these children and 
a high positive correlation between their educational and mental ratios. 
The possible bearing of these results on the current comparisons of 
negro and white children is evident enough and shows the necessity of 
more careful control in such comparisons of the factors of training and 
environment. These various illustrations will suffice to show some of 
the problems and the possibilities of the present investigation. 

Each year’s work has produced by-products to the main investiga- 
tion which in themselves make the study worth while. Among the special 
groups being studied are the feeble-minded, the retarded, the accelerated, 
the exceptionally gifted, the psychopathic children, the children of 
foreign parentage, and finally twins. In concluding, I should like to 
read short descriptions of several pairs of twins (which has been pre- 
pared by Miss Mary Wentworth, research assistant in this investiga- 
tion). The study of twins, by repeated measurements, illustrates one 
of the interesting ramifications of this investigation. 


ROLAND AND RUSSELL D. 


“Roland and Russell D. are as alike in appearance as two peas in 
a pod. Bright, sturdy, normal, freckle-faced little fellows, of 8 years 
of age, full of energy and vitality, well nourished, well developed, in- 
telligent and socially codperative, they promise much for the future. 
Each has thin hands, in contrast to his solid little body. The hands 
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were held tightly clenched during the examination. Each sat on the 
exact edge of his chair and held himself tense, tho one had no chance 
to observe the other. Each had a good, straightforward attitude, excel- 
lent habits of thought, and a conscientious ambition to do his best. 

“The responses in the Stanford were very similar. In May, 1923, 
each made a mental age of 8-2, 1.Q. 110. In January, 1924, Roland’s 
mental age was 9-6, Russell’s 9-2, 1.Q.’s 116 and 112 respectively. In the 
Stanford of 1923, Russell passed inferior ball and field, which Roland 
barely failed, Russell passed comprehension (January 8), by 2 points 
out of 3, while Roland failed by 1 point out of 3. Russell failed similari- 
ties, which Roland passed. It is interesting to note here that 8 months 
later it was Russell who passed similarities (year 12) which Roland 
failed, each having passed these in year 8. Russell passed weights 
(year 9), while Roland succeeded in memory for digits and for sentences 
(year 9). 

“In the test of 1924 the only difference of note is in auditory rote 
memory for digits, in which Roland was superior to Russell in each test. 
This may be based upon the fact that Russell is more impulsive than 
Roland and a little more self-conscious, so that it is more difficult for 
him to sustain his attention. The teacher has noted this impulsiveness 
as the only real difference she can find between them. 

“Russell made higher scores in all group tests, altho there was no 
great difference between them. Russell is the more objective of the 
two boys and quicker in thought and in action, in work with his pencil 
he is more accurate. This may account for his superior score on the 
group tests. He is slightly taller than Roland and weighs a little less. 
There is a difference in anatomic age in 1922 of 3 months, which in- 
creased to 8 months in 1923. All other physical measurements are ap- 
proximately equal. Dental ages are the same. 

“We see here a pair of twins in whom the similarities are so marked 
and differences so few and so unimportant except in anatomic ages that 
we have no hesitation in calling them identical twins.” 


ELAINE AND EILEEN M. 


“Here are two little girls of 7 who are so much alike in looks that 
their teacher can tell them apart only by the right eyebrow, which in 
one child has a slight irregularity. Their work in school is the same 
in every respect, their manner and attitude practically identical. They 
come from an average home. Their father is a stenographer. There are 
seven children, so that these twins have had no special training nor 
spoiling, but are normal and wholesome. Hair and eyes, general color- 
ing, and appearance are similar. One seemed a little shy, the other 
was more straightforward. This may have been one reason why Elaine’s 
rote memory was better than Eileen’s. 

“In physical measurements we find a difference of 2 mm. in height 
and of a pound in weight, all other measurements being practically the 
same. These variations are maintained over a year, as a comparison 
of the two sets of measurements indicate. 

“In 1922, there was a difference of 8 months in anatomic age, in 
favor of the taller child. In 1923, the difference was only 2 months. 
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Dental ages were the same. In June, 1923, each had an I1.Q. of 98 by 
the Stanford, Elaine failed right and left (year 6), diamond (year 7), 
and definitions (year 8) by 1 point, all of which Eileen passed. The 
latter failed on comprehension and memory for sentence (year 6), and 
picture description (year 7). 

“In 1924, Elaine failed counting 20-1 (year 8) and definitions 
(year 8), which her sister passed, while Eileen failed coins (year 6), 
repeating 5 digits forward and 4 backwards. Elaine did better each 
year in the Dearborn Group Tests by about 8 points. In the arithmetic 
and Detroit tests they were in the same percentile. 

“Elaine has more self-assertion and is less subjective in her atti- 
tude. She is apparently the leader of the two. This difference in per- 
sonality will have its effect on any outward expression, especially in 
group work involving competition. It may easily account for the higher 
score of Elaine in the group tests even tho Eileen showed greater skill 
in the use of the pencil in the individual test. 

“Differences between these twins are far outweighed by similarities 
and they can certainly be classified as identical twins. It is quite pos- 
sible that the differences in personality already expressing themselves 
may be due to environmental influences and may become exaggerated 
in the future so as to cause more apparent differences in behavior. 
Continued study of these twins over a period of years will be of much 
value.” 

HARRIET AND HELEN M. 


“This is a case of such marked difference in personality make-up, 
emotional reactions, and intelligence that one could not know they were 
twins. General coloring and complexion, hair and eyes are alike. Here 
the resemblance ceases. 

“Helen is a charming child whose smile bewitches everyone. She 
is sociable, affable, instantly popular, extraverted, interested in every- 
thing, normal, wholesome, natural, average in intelligence, quick, and 
impulsive. 

“Harriet is self-conscious, sober, not naturally sociable, not popular, 
intraverted, interested in ideas, abnormal, unnatural, grown-up, superior 
in intelligence, careful and methodical in thought. 

“They come from a good average home. There are two older chil- 
dren. The mother is very conscientious and eager to do the best for 
her children. She is at present much worried over Harriet. She says 
she has always been different and a source of anxiety. Harriet is one 
year above her sister in anatomic age. This may account in part for 
her mental maturity. 

“In February, 1923, the Stanford-Binet pointed to a considerable 
likeness in intelligence, both quantitative and qualitative, such as would 
indicate a pair of identical twins. On the other hand, there was a dif- 
ference of 24 points in the group test, partly due to the fact that Helen 
was more careless, had a poorer memory, and was more impulsive. She 
could have done better than she did. 

“In January, 1924, nearly a year later, Harriet shows herself 
superior both quantitatively and qualitatively in the Stanford-Binet. 
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Her arithmetic score is in the highest percentile in the group while 
Helen ranks as low average. The psychographs are markedly dis- 
similar and in no way indicate identical twins. 

“Thorndike’s ‘conception of a ‘social’ intelligence apart from an 
abstract and mechanical finds illumination here. The social adaptation 
of Helen is evidently due to traits and characteristics which are ap- 
parently only in small part intellectual. The attempt of Harriet, due 
to the lack of these traits, to equal the social success of her sister 
results in what is in part an acquired superiority in abstract intelli- 
gence, i.e., is due to compensation. The difficulty of making a division 
between the effect of inheritance and environment is thus again illus- 
trated. The present difference which the tests show in intelligence is 
evidently in good part due to environmental influences, the result of 
having a pretty sister who is constantly a center of admiring glances 
creating in her the struggle to win social commendation in the easiest 
way she can. 

“This is a good example of the advantage of repeated observations 
and measurements and of the necessity of interpreting the test findings 
in the light of knowledge of the personality and general behavior of the 
individual.” 


The significance of these descriptions of the resemblances and dif- 
ferences in twins will be increased with each added year of observation 
and testing. The changes in twins which have taken place, in some 
instances within a year period, show that what appear superficially to 
be the same environmental influences may lead to individual variations 
or differences in mental characteristics as important as any which may 
be attributed to heredity. 














Reliability and Uses of Group Tests of 
Intelligence 


WALTER F. DEARBORN, Psycho-Educational Clinic, Graduate School of 
Education, Harvard University 


IN presenting the subject of this evening’s discussion, I wish to 
reverse the order of topics and speak first of some of the uses and then 
of the reliability of group tests of intelligence. In speaking of reliabil- 
ity, I should like also to refer briefly to what is now commonly referred 
to as the validity of the tests. In discussing the uses of the tests I shall 
assume your acquaintance with the more usual findings and shall call 
attention, for the most part, to the employment of the tests in the study 
of problems which until the advent of the group tests had either not 
been recognized as such, or not adequately attacked. The data to be 
presented have been secured chiefly from the use of my own group tests 
of intelligence. 

The special province of the group test would seem to be where the 
number of cases necessary for an adequate study is too great to admit 
of individual examinations. Whatever may appear later in regard to 
the reliability of group tests for determining the intelligence of indi- 
viduals, they are entirely adequate for the comparison of classes of 
children. The following studies will serve as illustrations of such uses 
of the tests. 

In a number of community surveys we have examined, sometimes 
within a day or two, every child in school from the first grade thru the 
high school. The comparison of findings brings to light a number of 
interesting differences in the grade classifications of pupils. When 
shown graphically, a bird’s-eye view may be secured of this important 
problem. 

The first chart shows in its upper half the chronological age-grade, 
and in the lower half the mental age-grade classification in the first, 
second, fourth, sixth, and eighth grades of a city in the neighborhood 
of Boston. The third, fifth, and seventh grades have been omitted in 
order that the chart be not too confusing. The differentiation between 
grades is seen here, as in many communities, to be more a matter of 
chronological age than of mental age. In the mental age distributions, 
the variability increases with each advancing grade, and after the third 
grade the overlapping is so great that, except for the gradual advance 
of the average or median mental age, differentiation is difficult. 

The grade differentiation on the basis of mental age is somewhat 
better in a second and socially more favored community, adjacent to 
the first. The first, third, fifth, and sixth grades only are here shown. 

The third series of charts shows the chronological and mental age- 
grade distributions for three communities which have been here com- 
bined. All of these communities have been regarded as the result of 
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special studies as fairly average or typical New England towns. The 
distributions are shown for grades two, four, six, eight, ten, and twelve. 

The variability in mental ages which may be noted in different 
grades, in these various charts (in one city, e.g., the variability of the 
sixth grade is much greater than in any other sixth grades studied) 
raises the question as to what may be considered as normal, in the sense 
of common practice, in our schools. In general, variability in mental! 
age should increase with advance in grade just as with advance in age, 
but this general factor is somewhat covered up by the greater amount 
of retardation in the early grades and the greater extent of elimination 
in later grades especially after the sixth grade. 

In the combined grades of these communities last mentioned, the 
range in mental age of the middle 50 per cent of pupils in each grade 
beginning with the second is as follows: 11 months, 16 months, 23 
months for the fifth and sixth grades; 28 and 29 months, seventh and 
eighth grades; and an average of 24 months or two years for the high 
school classes. These irregularities are not so apparent when the com- 
parison is made in terms of intelligence quotients, and we find that a 
-class is, according to current practice, considered sufficiently homogen- 
eous for purposes of instruction at least half of whose members differ 
in either direction from the average by not more than, approximately, 
10 points in intelligence quotients. Some such measure would evidently 
be useful to a superintendent of schools in considering the problems of 
grade classification. 

In the elementary schools of neighboring Massachusetts towns we 
have found the median intelligence quotients to vary from 85 to nearly 
115. Much more should evidently be expected in school accomplishment 
in some of these schools than others, but the results of school tests 
show that this is not always the case. It has frequently been shown 
that within the same school system, advantage is often not taken of 
these differences in intelligence, and the same appears to be true as 
between different communities. 

I wish to call attention to some schools which are particularly for- 
tunate in their selection of pupils, and of whom therefore much should 
be expected in the way of accomplishment. 

To give a standard for comparison the first graph in this series 
(Chart IV, bottom graph) shows the distribution of intelligence quotients 
in the three communities above mentioned in which rather average or 
typical conditions have been found. All children from the second grade 
thru the high school are included, 3,623 cases in all. The median I.Q. 
is 103; 2; or the 25th percentile = 91; 2; or the 75th percentile = 114 and 
Q=11. The range of the middle 50 per cent of cases is thus from 
91 1.Q. to 114 1.Q. The total range of cases is from 50 I.Q. to 160 1.Q. 
The second graph from the bottom shows the selection which takes place 
within one of these school systems, being the distribution of the intel- 
ligence quotients of 275 pupils in one of the high schools. The median 
1.Q. is 114, which is at the 75th percentile of the total distribution, 2; or 
the 25th percentile is 103, which is at the median of the total distribu- 
tion, and 2; or the 75th percentile is 122 1Q. Q=+9. The selection 
of the high school pupils is such, therefore, that they may be said, with- 
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out inquiring too carefully in regard to possible zero points, to be about 
25 per cent more intelligent than the total group. (Their superiority 
to an unselected group of pupils is really somewhat greater than is 
shown by the comparison. If we take as a group all of the nine-, ten-, 
and eleven-year-old children in these towns,—and at these ages, prac- 
tically all of the children are in school,—the median I.Q. is 100 + 10 
instead of 108. The above combined group is, in other words, itself 
somewhat “selected” since it includes the upper grade and high school 
children.) 

The next graphs (the middle graph and that next to the top of 
Chart IV) show the distribution of intelligence quotients in two private 
academies, one a school for girls and the other a school for boys, and 
the last graph at the top of Chart IV, the distribution of a public Latin 
school. Altho these schools include pupils of the last six grades instead 
of the last four as in the high school, they are still higher in average 
intelligence. The median I.Q. in the girls’ academy is 116, in the boys’ 
118, and in the public Latin school 124. In the latter school, where 
there is of course an unusual selection, the median child is above the 
75th percentile of the average high school, and less than a sixth of the 
school are below the median of the high school group. With one excep- 
tion, they are all above the average of the unselected group of the 
average communities, and 90 per cent are the equals or superior to the 
best 25 per cent of the unselected group. 

I have had an interesting correspondence with an excellent teacher 
of this Latin school who wrote me that it was his impression that 
“These results show a higher standard of intelligence than the boy’s 
school work would seem to indicate.” I should like to quote from my 
reply to his letter, as it expresses what I had in mind to say here in 
regard to the problem of judging, in the absence of specific tests of 
school accomplishment, whether pupils in such favored schools are living 
up to their possibilities: 

“Our judgment in regard to the relative superiority of your stu- 
dents to the general average is based on a number of facts. I can only 
mention a few of them. A study made a few years ago by Mr. Lincoln of 
our staff showed that the work in college of the graduates of your school 
was equal to that of the average of the college group in the first two 
years, and that there was a marked improvement of their grades in the 
last two years. At the time his study was made it was found that 90 
per cent of your graduates had entered college and that 43 per cent 
were continuing their studies five years after graduating from the Latin 
school, most of them presumably doing work for advanced degrees. 
These figures are significant in connection with current estimates that 
at the time the study was made only 50 per cent of the public school 
graduates were going to the higher institutions of learning, and only 
33 per cent were entering college. 

“I would interpret the present findings in regard to the I.Q. as 
indicating that you are maintaining much the same standards of selec- 
tion in the case of your pupils as formerly. 

“Proctor, in a recently published study of results in California 
schools, finds that the average I.Q. of high school freshmen is 105; that 
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of high school graduates, 111; and that of college freshmen, 116. He 
estimates that an I.Q. of 120 may be expected of college graduates. 
Similar records which we have obtained showed a median I.Q. of 125 
for Harvard freshmen, 133 for advanced students, and 137 for those 
doing graduate work. These figures are on the basis of an average 
adult mental age of 14%. On the usual basis of calculation, namely 
16 years, the figures would run 113, 120, and 124. These are the groups 
in which your students are holding their own, and to do this they must 
have correspondingly high I.Q.’s. Your results are also calculated, as 
you know, on the assumption of an average adult mental age of 14%. 
The 1.Q.’s would be decreased after the eighth grade by calculating on 
the usual basis (Terman and others) of 16 years. This, incidentally, 
is one of the reasons why we believe 14% is nearer right for the average 
adult mental age. 

“Further, the findings as to I.Q.’s in nearly all the group tests show 
higher results than the individual Binet Tests. This I believe to show 
that the standards of the Binet are too high in the upper years, because 
group tests are about as accurate as individual tests for determining 
group averages. 

“The bearing of this last comment is this, namely, that we have 
hitherto been accustomed to judge in regard to the frequency of high 
1.Q.’s from our experience with individual tests. 

“Finally, as regards your impression that ‘these results show a 
higher standard of intelligence than the boys’ school work would seem 
to indicate’ I would make these comments: First, it is possible that the 
students are not living up to their possibilities; but, how can you com- 
pare the accomplishments of your students with what average students 
would do? One does not find average groups after the seventh grade, 
and, therefore, it is difficult to ascertain what the average pupils would 
do in the studies of the ninth to the twelfth grades. The only way it 
seems to me you could justify your. impression would be to compare 
what your students did in the sixth and seventh grades in the schools 
from which they came with the average accomplishment of those grades. 
If they, as a group, were then as superior to the grade average as their 
present I.Q.’s would indicate they might be, I doubt whether you would 
be justified in your impression. The presumption would be that they 
would maintain their relatively superior attainments-in later years.” 

The problem of the average mental age, to which reference is made 
in the letter above quoted, is well illustrated in the median I.Q.’s in this 
school. If the 1.Q.’s are caleulated on the basis of an average adult 
mental age of 16, the grade medians decrease with each advance in 
school grade as follows: seventh grade, 125; eighth grade, 123; ninth 
grade, 119; tenth grade, 117.5; eleventh grade, 110; and the twelfth 
grade, 105. Such a result is quite impossible. Calculated on the basis of 
an average adult age of 14% there is but little difference in the lower 
grades, but the three upper grades average, as would be expected, about 
6 points higher than the three lower grades. 

Finally, I wish to call attention, in this series of charts, to the dis- 
tribution of I.Q.’s in the first three grades of a suburban community 
(Chart V). The median I.Q. is 114. This group of early grade chil- 
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dren is, therefore, as select a group as the average high school class. 
It would be interesting to inquire whether they are showing equal 
superiority in school attainments, but this would take us too far afield 
from the objects of the present discussion. 

In further illustration of the uses of group intelligence tests, I will 
mention briefly the findings of two recently completed doctorate theses. 
Dr. L. Thomas Hopkins in an investigation, the results of which are 
now being published in the Harvard Contributions to Education under 
the title of The Intelligence of Continuation School Pupils in Massa- 
chusetts, has examined about 1,200 continuation school pupils in five 
Massachusetts cities and towns and compared them with about 2,000 
pupils of the same ages in the regular schools of these communities. 
Different types of racial and industrial centers are represented. Two 
of the communities have a fairly homogeneous racial group and few 
industries employing a high type of skilled workmen (automobile body, 
boot, and shoe) ; a third city is very heterogeneous as to race and diversi- 
fied as to industries; the fourth is a typical cotton mill city; the fifth 
city is in the metropolitan area, and differs from the others in including 
pupils in its continuation school from a number of surrounding com- 
munities. The findings in these varied communities were much the 
same: “The intellectual development of continuation school pupils is 
on the average about 2% years less than that of the regular school 
pupils of the same age; about a quarter only of the continuation school 
children exceed the 25th percentile of the regular school distribution 
(in terms of mental ages), and a smaller proportion reach the median 
or average age of the latter groups.” 

Not all of the difference between these groups can be attributed to 
heredity. It is in part the effect of schooling. The intellectual develop- 
ment of some (how many no one knows) has suffered because the schools 
have not provided the right sort of training for them. The usual 
academic training has failed where there is good reason to believe train- 
ing of a different sort might have succeeded. Pre-vocational and trade 
classes would have helped, but admission to such classes usually depends 
on the completion of at least the work of the sixth grade, and that is 
just what these children have not been able to do. With the remarkable 
increase in recent years in the enrollment of other junior and senior 
high schools, the enrollment of the trade and technical high school has, 
in many communities, remained static. If these schools would come 
down from their high places and suit their instruction to the needs of 
this group, their ranks might be filled. The sciences basic to the art of 
plumbing are numerous, but they should not estop a plumber’s helper 
in the making. 

Whatever the merits of this suggestion—and the writer is aware of 
some of its complications—here is a group of children at least half of 
whom have been seriously retarded because of this inability to learn 
well what is ordinarily taught in the fifth and sixth grades. Is there 
no other alternative to keeping them in these and lower school grades 
until they are 14 years of age? And—should not these considerations 
give pause to those who, without providing measures to meet existing 
difficulties, advocate the keeping of these children in school, by legal 
measures, for a year or two longer? 
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The second doctorate study to which I have referred as an illustra- 
tion of the uses of group intelligence tests concerned the question of 
possible handicaps under which children of foreign parentage may be 
in our schools. Educators have commonly minimized the importance of 
this question and have said that the trouble was more a general intel- 
lectual one than a linguistic one, and that, if foreigners were of average 


intelligence, a few years in school would be sufficient to overcome any - 


language handicap. The findings of the present investigation by Dr. 
Fick do not support this rather complacent opinion. Previous studies 
have been open to criticism in that they have either not established the 
possible differences in intelligence or have not adequately determined the 
language difficulties. 

The Army Beta Test was used in this study and was given without 
any verbal instructions to children in the third grades of three types 
of schools: (1) practically 100 per cent Americans; (2) about equal 
number of children of foreign and native parentage; and (3) practically 
100 per cent foreign. At the ages common to these grades, it was found 
that the foreign children were on the average a year behind the Ameri- 
can children in the test scores. Two reading tests (the Ayres Burgess 
and the Thorndike-McCall) were then employed to gauge the language 
ability. “The intelligence factor was kept constant by selecting for 
comparison only foreign and American children of equal intelligence. 
This was done by selecting all children of Army Beta Intelligence scores 
between 50-59 for comparison in language ability, and all children be- 
tween 60-69 for comparison in language, and likewise all children be- 
tween 70-79.” 

The findings were as follows: 

“At each level of intelligence the foreign children were found far 
inferior to the American children in language ability. 

“It was also found that the American children instructed in the 
mixed schools with foreign children were inferior to American children 
in a separate school at each level of intelligence. This may indicate 
that the foreign children impede the progress of the American children. 
Of course such factors as better methods of instruction in the separate 
school or better home environment may have influenced these later 
results. 

“In the case of foreign children it was found that those instructed 
in the mixed schools together with American children were better off 
at two of the levels of intelligence, as compared with the foreign children 
in the school composed only of foreigners.” 

“It was also shown that the initial language handicap did not de- 
crease in the case of the foreign children in comparison with the Ameri- 
can children, as the groups progressed through the grades. 

“The inferiority on the part of the foreign children in the two 
silent reading tests in comparison with American children at every level 
of intelligence indicates that the foreign children are handicapped in 
receiving their instruction through English as compared with American 
children. This is a safe deduction as the reading test measures an 
excellent replica of the medium of instruction. 
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“Furthermore, the fact that the foreign children do not on the 
average show a tendency to overcome the initial obstacle or handicap 
enhances the seriousness of such a handicap greatly. Such a fact, too, 
may justify the emphasis placed on instruction through the home lan- 
guage in the initial stages of education.” 


These above-described studies illustrate some of the current uses 
of the group intelligence test. In speaking finally of the reliability of 
these group tests, I wish to make certain comparisons with the indi- 
vidual intelligence test which may also bear upon the relative validity 
as well as reliability of these two kinds of tests. 

The group test may lead to different results from the individual test 
for at least two reasons: first, because it is a group test, that it is given 
to groups rather than to individuals; and, secondly, because it may differ 
qualitatively from the individual test. These two factors need to be 
carefully separated in the discussion of the problem. 

It has been our experience that the first cause of difference may 
be greatly lessened if somewhere nearly as much attention is given to 
the technique of group testing as has been given to that of individual 
testing. It is a different skill and not easily mastered by many testers. 
As to the second point, it was, in my own case, the intention to make 
a less verbal test and to sample a somewhat different lot of abilities 
from that of the Binet examination, and, therefore, somewhat different 
results were anticipated. The extended use of the individual Binet Test 
has, however, made it, with due recognition of its limitations, a standard 
of comparison, and the relative reliability as well as validity of the 
group test may be best appraised by comparing it with that of the 
individual test. 

Otis has proposed a method of determining the reliability of the 
Binet, by division of the tests of each year group into halves. Two 
mental ages are then figured according to the number of tests passed 
in each half of the scale. The difference between these two mental ages 
is the basis of his measure of reliability which he calls the probable 
error of the single score. He finds this to be +6 months for adults. 
Dickson using the same method found the probable error of the single 
score to be 3 months in the case of first zrade children. Assuming a 
median mental age of the adults examined to be 14 and that of the first 
grade children to be 7, these two determinations agree on a probable 
error in terms of I.Q. of 3.57 points. This means that the error of a 
single examination with the Binet will be 6 points or more in one 
quarter of the cases, 9 points or more in one case in 10, and 14 points 
in one case in 100. 

The added findings which we present for discussion are shown in 
Tables I and II. Otis’s results are given in the first column for com- 
parison. In the second column Otis’s method of halving the Binet Test 
is applied to the cases of 145 first grade children whom we have recently 
tested. The correlation of intelligence quotients (as determined from 
the halves of the test) is somewhat higher than Otis found and the 
reliability as measured by the P.E. of the single score (3.31) somewhat 
greater. The Probable Error of Measurement (Formula B of Monroe) 
is 3.25. These 145 children were retested at short intervals (on the 
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average about 3 months later), and as a second measure of reliability, 
the intelligence quotients on the first (total) test were compared with 
the intelligence quotients secured on the retest. The results are shown 
in column 3. The probable error of the single score is somewhat less 
(2.56) and the P.E. of Measurement (4.04) somewhat larger than by 
the first method. r 

If now we treat the group test—in this case the Dearborn—as if 
it were an alternative test to the Binet, and in this way measure its 
reliability, we secure practically the same results as secured by Otis 
and ourselves with the Binet. In column 4 the comparison is between 
the intelligence quotients of Dearborn General Examination C and those 
of the Stanford-Binet in the case of 210 high school pupils; in column 5 
the three measures have been calculated for 75 cases (chiefly border- 
line or mentally deficient) recently reported by Frank N. Freeman 
(Journal of Educational Psychology). 

In column 6 the results are based on the testing of 211 first grade 
children with the Stanford-Binet and the Dearborn Group Intelligence 
Test, Series I, as recently reported by Dr. Gertrude Rand (Journal of 
Educational Research, Vol. 9, No. 3, March, 1924). The findings in 
these last three comparisons (columns 4, 5, and 6) are in close agree- 
ment with those of the first three (columns 1, 2, and 3). This agree- 
ment between the individual and group test argues that the validity as 
well as the reliability of the tests is similar. 

Since by retesting with the complete Binet we find the same reliabili- 
ties as by the Otis method of comparing the halves of the Binet, the 
method of retesting with the group test at varying intervals has also 


TABLE II 


1 2 





Comparison.......... Dearborn Series I 1922 | Dearborn C April 1923 
| Dearborn Series I 1923 | Dearborn C May 1923 
Number and kinds of pupils} 478 first and second | 


grades Fourth to sixth grades 
Correlation (r)....... + .84 + .92 
P.E. of single meas........ 
P.E. of meas. 
Formula B (Monroe). 


4.29 3.97 
been applied as a measure of its reliability (Table II). The findings as 
to reliability by this method are not materially different from the above 
findings. 

The Dearborn Group Intelligence Test, Series I, was given to a 
group of 478 first and second grade children in May, 1922, and was 
given again to the same children a year later, May, 1923. The correla- 
tion between the intelligence quotients on the two tests was +.84 and 
the P. E. of Meas. 4.29 (column 1, Table II). The probable error of 
a single measure (4.29) is in this case only approximate, and is some- 
what too large since it has been calculated without correcting for con- 
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stant errors. This correction has been made in the case of all the other 
probable errors reported, by the use of Otis’s graphical method for trans- 
lating the scores on one scale into equivalent scores on the second scale. 

Dearborn General Examination C (of Series Il) was given to all 
fourth, fifth, and sixth grade children in one school in April, 1923, and 
repeated on the same children after a single month’s interval. The 
results are shown in column 2 of Table II. 

The explanation for these findings we do not attribute altogether 
to the character or merits of the particular group test used. Rather 
we attribute them in good part to the fact that the examiners have been 
well trained and have mastered the art of giving group tests in the 
same way that the Binet tester has found it necessary to do. It has 
been suggested that anyone with little or no training can give the group 
tests. We have not found this to be the case, if one cares about reliable 
results. The handling, particularly, of the elementary school grades as 
groups for testing purposes requires a skill which only comes with 
much practice. The examiner must learn to recognize whether he can 
handle a given class as a group or must cubdivide it. He must have 
a certain, not group consciousness, but consciousness of the group. 
Under such conditions we believe that the results of the group tests may 
be made nearly if not quite as reliable as the individual, and the above 
findings bear out this statement. 

Finally, it may be well to call attention again to the extent of the 
reliability or unreliability above determined. The P.E. of a single meas- 
ure indicates that half of the cases will vary in any given test by 
about 3% points of I1.Q. from the true measure of their intelligence (as 
far as the tests test it). A quarter will vary to the extent of 6 points 
or more, one case in 10 to the extent of 9 points or more, and one case 
in 100 to the extent of 14 points or more. If we use the P.E. of Measure- 
ment as an index of reliability these figures will usually be increased by 
a point or two. This shows the hazard, which was so well brought out 
this morning by Dr. Woody, in the case of subject-matter tests, of relying 
on a single measure of intelligence. 
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Special Disability in Learning to Read 


WALTER F. DEARBORN, Psycho-Educational Clinic, Graduate School of 
Education, Harvard University 





WITHIN the last twenty-five years a considerable number of children 
have found a place in the writings of neurologists and ophthalmologists 
because of their failure to learn to read under capable instruction or 
because their difficulties in learning to read were so extreme that they 
were considered pathological. Since these cases were rare and were 
thought to be due to congenital, and therefore, by implication, largely 
irremedial, factors, they have remained until recently largely the con- 
cern of the physician. 

It now appears that while such extreme cases as were described 
by the term “alexia” are rare, it is not so unusual to meet with cases 
approaching them in difficulty (better called “dyslexia”) and that much 
can be done by education not only for the latter group, but also for 
the former. Indeed the successful instruction of these children makes 
it clear that if there are congenital factors it is no longer sufficient to 
say, as for example Hinshelwood does, that the defects in question are 
localized in the supramarginal and angular gyri of the brain in which 
are deposited the visual memories of words and letters. A more exact 
analysis of the nature of any such defective inheritance is needed. 
Further, in many cases the question arises whether the difficulties are 
not at least quite as much matters of acquirement as of inheritance. 
The writings of Hinshelwood himself, of Bronner, Grace Fernald, 
Keller, Leta Hollingsworth, Freeman, Fildes, Gray, and others have 
already shown these latter possibilities and promise the educator a 
suitable return for his investment of interest in these children. 

Within the last five or six years we have in our Psycho-Educational 
Clinic discovered about twenty-five such children. In possibly only 
one of these cases was the disability so great as to warrant classifica- 
tion as “alexia” or word-blindness. We had for some years considered 
this case of a boy as our one clear illustration of this condition. The 
fact that he has, subsequently to our first examination about six years 
ago, and to a short period of intensive training which we then gave him, 
become a capable reader, has led us to challenge the correctness of this 
earlier diagnosis. I shall refer to this case later. 

My own interest in these cases is first a practical one; to discover 
what can be done for them individually, and to note whether the special 
work and methods used illuminate at all the general pedagogy of read- 
ing; and secondly, for the significance which an analysis of these cases 
may have for general psychological theory. I should like first to say 
a word in regard to this latter question. 

We have for long been suffering from a bad habit in psychology 
of attributing to heredity what we have not otherwise been able to 
account for. Instead of explaining anything, this way of disposing of 
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the matter has usually for a time stopped further inquiry. We have, 
for example, in psychology long lists of instincts, most of which, now 
that we know a little mcre about the mechanisms of learning, appear 
to be, as far as inheritance goes, mythical entities. So in the cases 
of special reading disabilities, family “trees”, like those in the case: 
of feeble-mindedness, have been described extending back into the fourth 
and fifth generations. It would thus appear that we had to do with a 
form of inherited defect, the location of which was said to be in certain 
“brain centers”. The study of these cases has, thus, general theoretical 
interest because it would appear that instead of the inheritance of any 
vaguely, if centrally, located blight—such as might impede the develop- 
ment of association fibers in the appropriate centers of the brain and 
thus result in a general reading disability or word-blindness—we find 
a number of minor sensory or perceptual defects and motor difficulties, 
differing in different individuals, which may interfere somewhat with 
a given method of learning to read but perhaps not with another. These 
defects may also make any method of approach a little more difficult 
than it would otherwise be. We have further found that in all but one 
or two of our cases these defects have been associated with a certain 
nervous instability, or, in more objective terms, a lack of proper bring- 
ing up and discipline in the home, so that the traits of character, the 
incentive or interest sufficient to overcome the above-mentioned handi- 
caps are lacking. Burt, in a paper on “Unstable Children” published 
in 1917, notes that neurotic children are often deficient in reading, tho 
intelligent, and Hollingsworth in commenting on Burt’s observation, 
puts the matter thus: “This follows from the psychology of the mechan- 
ics of reading. Mastery of these mechanics calls for an ordinary degree 
of codperation, adherence to definite directions, power of sustained effort, 
and fidelity to bare facts. Neurotics are those who are characteristically 
inferior in these essential qualities, among others. Where impulsive re- 
sponse, negativistic attitude, flightiness, and illusion cause failure, 
neurotic children fail. Hence many of them never learn to read, except 
by individual teaching.” 

A brief review of the literature beginning with the work of Hinshel- 
wood will be sufficient to illustrate the older point of view and the extent 
to which one or the other or both of the above-mentioned factors appears 
in the description of cases. 

We shail limit the term “non-readers” to children of normal or 
superior intelligence who, under capable instruction but for the most 
part in classes, have not been able to learn to read. 

Hinshelwood (Congenital Word-Blindness, 1917) limits his discus- 
sion to cases of extreme gravity in which the difficulty or inability is 
not associated with or complicated by other evidence of intellectual de- 
fect and considers the condition due to “a lesion in the left supra- 
marginal and angular gyri in which are deposited the visual memories 
of words and letters”. He believes that with perseverance and indi- 
vidual attention these word-blind children can be taught to read. As 
to methods he recommends frequent repetitions of the visual impression 
thru short, well-graduated reading lessons, and prefers the old-fashioned 
method of learning the alphabet and spelling out the words to the “look 
and say” method. 
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Bronner finds no evidence of aiexia or general defect, notes the 
possibility of the original difficulty being in poor visual or auditory 
“powers”, or where the separate processes are good, in a failure to 
“synthesize”. Gray’s analysis of causes includes “inability to analyze 
phonetic elements, failure to scrutinize in detail”. Lucy Fildes con- 
cludes that failure is due to specific defect in the auditory or visual 
field. She, as well as Bronner, finds difficulty in the perception and 
memory of visual forms other than words. Gates’ work, on the other 
hand, does not support this latter finding. We should be inclined to 
hold that if the non-readers had no especial difficulty in the perception 
and memory of visual forms other than those of words, this need not 
argue for a special defect in word perception, since it may simply point 
to emotional difficulties associated with the learning of words. 

In her experimental study of poor spellers, Hollingsworth found 
three cases of marked inability to improve with special instruction. She 
believed this was due to specific defect in forming one or more of the 
six bonds necessary for spelling—visual, auditory, motor, hand-voice, 
meaning, and seqyence. 

It should be noted, however, that one of her difficult cases was a 
stammerer, and that of the other two failures where intelligence was 
normal, one was very “bossy” and the other was very self-conscious 
in regard to ker non-spelling. More careful analysis of the eavly train- 
ing and home ervironment might show the real causes to be in some 
early difficulties which the child avoided or to which he did not make 
adequate response, with a resulting failure to integrate and form the 
necessary bonds. “Bossiness” and a sense of inferiority are both fre- 
quently symptomatic of a nervous temperament. 

In our own cases we have frequently observed special difficulties in 
the auditory-vocal memory span for both numbers and letters and in 
the pronunciation of certain sounds, notably “1” and “r”. Further, a 
number of our non-readers were described by their teachers as speaking 
(and singing) in monotones, as unable to carry a tune, and finally to 
have special Gifficulty with vowel sounds. Some of these children could 
not build “word families” because of their inability to rhyme. They 
gave as rhymes such words as “mend” and “tin”. The avoidance of 
phonetic methods by a number of investigators may tacitly argue for 
special if unanalyzed auditory and vocal difficulties. So Thomas earl: 
noted that the phonetic method was usually unsuccessful with non- 
readers, and advocated the tracing of words, since, to quote him, “It is 
possible that the earliest memories of letters are muscular.” So Grace 
Fernald achieved success with her most difficult cases by a similar 
kinaesthetic method of tracing and writing words to reinforce the visual 
and auditory codrdinations, and Freeman “dropped drills in pronuncia- 
tion and word building with such elements as ‘pl’ and ‘wh’, etc., as 
often causing difficulty by developing a habit of attack by such minute 
elements; and tried to develop direct association between the sight of 
words and their meaning, by flash cards, short sentence reading, etc.” 

Contrariwise, in other cases, notably as has just been mentioned, 
those of Hinshelwood, the sounding of words and spelling them out 
seemed preferable to the “look and say” method. Hollingsworth in her 
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recent book on Special Talents and Defects describes a boy with extreme 
disability whom she finally taught to read by this same method of spelling 
out the letters and sounding them as he went along. Finally we have 
had cases for whom the kinaesthetic method, so well developed by Dr. 
Fernald, did not appear to be the easiest avenue of approach. This 
was recently the case with a new subject who was left-handed. 

The presence of left-handedness may itself point to a motor difficulty. 
This possible relationship between left-handedness and non-readers has 
been noted by others as it has also been in relation to stammerers. The 
proportion of left-handed in our cases—our last two cases were left- 
handed, making the proportion out of the total over one-third—is too 
great not to have significance. The difficulty would, of course, seem to 
be not the mere left-handedness, but that parents attempt to interfere 
when it is too late; namely, at school entrance. lf they had wished the 
child to be right-handed, they should have given some attention to the 
matter in early infancy. 

This cursory review of some of the possible perceptual and motor 
difficulties and of the methods employed with-non-readers may leave onc 
with the impression that after all it is not a question ‘so much of special 
defect as of the lack of individual attention and failure to meet diffi- 
culties squarely as they arise. This is certainly true in some cases, and 
I have already called attention to the fact that in practically all of 
our own cases evidence of a more general lack of stability and of 
faultiness in early training is present. Yet it appears to me that some 
avenues of approach or methods of training are better suited depending 
on individual differences in perceptual and motor habits, and that these 
differences are worth looking for. 

It would take much too long to describe and illustrate adequately 
the faults in the early training and the more general difficulties which 
we believe are perhaps primarily responsible for the special disabilities 
in question. We are planning to publish in the Harvard Monographs 
a description by Miss Elizabeth Lord of the above-mentioned case, 
originally thought to be one of congenital word-blindness, and in a 
second Monograph by Miss Elizabeth Hinks, under the title Reading 
Disability in Relation to Neurosis, the descriptions of a number of the 
other cases. I will cite briefly some of the special difficulties in the 
first case and the conclusions of the latter study. 

John was 12 years old and in the fourth grade of a public school 
when we first became interested in him in January, 1918. There is 
nothing of especial significance in the family or personal history except 
for a head injury at the age of 5. An X-ray examination at the age of 
12 gave no indication of earlier trauma. According to the statement 
of the mother, John knew his letters before this injury. Two psycho- 
logical examinations had been given to him previous to our acquaintance 
with the boy. The first when he was 9 years, 9 months old gave a 
mental age of 8 3/5 years, I1.Q. 88. He did not then know his letters. 
The second examination at 11 years, 10 months gave a mental age of 
12 years, 6 months, or an I.Q. of 105. He did well in all tests except 
those involving reading; and the examiner noted that he could then 
not read “some of the simplest one-syllable words tho passing one of 
the fourteen-year tests’’. 
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An examination of his eyes by a capable oculist showed a high 
degree of muscular disturbance in distant vision, but no apparent eye- 
strain in near vision, and in the opinion of the oculist the condition of 
his eyes had no bearing on his reading disability. 

The family conditions were fairly good. John took a normal part 
in the family life and work and in play with his brothers and sisters 
and other children of his age in the neighborhood. Both his father and 
mother had tried to teach him to read and were much concerned over. 
his difficulty. He had spent two years in each grade and at the age 
of 12 was repeating the fourth grade. According to his teacher he 
did fairly well in arithmetic and writing and in geography when in- 
structed orally. His spelling was so poor that his papers were not 
corrected, and his reading so inadequate that he was no longer called 
on to recite. He was, however, to be promoted to the fifth grade altho 
really not up to grade in any subject. 

His special defect in reading was substantiated by several tests of 
reading: He was asked to read a short paragraph made up of the 
vocabulary of the story of the Little Red Hen (Progressive Road to 
Reading). It was composed of 211 words, 112 of them being different 
words. It took John ten minutes to read the passage, 43 words were read 
correctly, 39 words incorrectly, 20 words both correctly and incorrectly, 
5 words were told to him, and 5 were omitted. Any passage beginning 
with a capital “O”, as this one did, was apt to be read as “Once upon 
a time”. The character of his mistakes can be gathered from the first 
line or two. The passage began “Once upon a time, long, long ago, a 
little red hen lived by herself in a wee brown house.” He read as fol- 
lows: “Once upon a time........ a little red hen lively but had in a 
field, but hearse........ 2 

As a second test he was tried out on the vocabulary of Book I 
of the Progressive Road to Reading, each word being presented sepa- 
rately. Out of 267 different words, only 22 were given correctly, and 
of those, 6 gave variable results. Many of these words he had studied 
in reading lessons. 

Samples of his spelling are as follows: “Chout Clows” for “Santa 
Claus”; “sotuling” for “stocking”; “fireder” for “fireplace”. ‘Chonxe”’, 
“choncer”, “mondnt”, “ptnolie”, “shont”, “caredam” stand for common 
words, the first letter of which usually represented correctly the initial 
sound. 

A comparison of the reading, vocabulary, and spelling tests shows 
that when the context and initial letter did not furnish a clue, the sub- 
ject could not master even the simplest word; and, since words once 
read correctly might later be misread or confused with other words, 
the examiner could not be sure that any words had really been satis- 
factorily mastered. 


PEDAGOGICAL METHODS 


The first thirty lessons, of twenty to fifty minutes daily and ex- 
tending over about a month’s time, were devoted to the trying out of 
methods to determine which might best aid the boy in learning to 
recognize words as ideograms and in the ability to read unfamiliar words. 
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Following the suggestion of Hinshelwood, the first trials were made 
by spelling out the words. A few words were assigned daily. They 
were usually learned in so far as they could be written and spelled 
correctly, but might not then be recognized in printed form. “The 
sound of the letters as they were spelled from the book might recall 
the word which he had learned to spell and write, but the sight of the 
word without this aid to memory was not sufficient. For instance, long 
after ‘were’ and ‘was’ were correctly spelled and written, ‘were’ was 
frequently called ‘was’ when presented in printed form, etc., etc. 

“Words containing the same letters, if given together for a lesson 
at home, were confused for days. The word learned first was written 
first. Example, ‘Want’, ‘Went’, written on the slip of paper. The 
next day the subject was asked to write ‘went’ first but wrote ‘want’ 
and persisted several times in making the same mistake although he 
would later recognize his error. This mistake or a tendency to make 
this mistake would persist for days. The method of learning words by 
himself was continued because the words when finally learned were 
helpful in other studies, and it employed otherwise idle moments.” 

One interesting finding is reported in connection with this method 
A deficiency in the auditory-memory span has been observed in a goodly 
proportion of our cases of non-readers. A similar difficulty appeared 
in this case in spelling aloud words of five or more letters. 

“The repetition of five or more letters gave so much difficulty and 
the apparent lack of transfer from the ability to spell to the ability to 
read the word, suggested that it was a waste of time to continue with 
this method. For instance, about 50 per cent of the words of five letters 
were correctly spelled, after having the correct spelling once, but the 
correct spelling cf a word of six letters had to be given five to eleven 
times before it was correctly repeated. A short time each day was 
given to this method for about a month. About three weeks later at 
the end of the total period of instruction the auditory-memory was again 
tested in this way. The five-letter words were spelled correctly after 
hearing them spelled once and 66 per cent of the six-letter words were 
also spelled correctly. There seems to have been a definite increase in 
memory span during the two months of drill.” 

The second method experimented with was that of reinforcing the 
visual memory of the word by drill in writing, or in other words, by 
kinaesthetic sensations. The method was in general that suggested by 
Thomas and others and, later, so successfully employed by Dr. Grace 
Fernald. In this case, however, it was found that “the ability to write 
a word did not seem to aid the subject in reading the words in print’, 
and the method was therefore abandoned. This result may be due to 
the fact that this method did not happen to fit the particular difficulties 
of this case, or that the method was not pursued for a long enough time. 
Dr. Fernald has observed that in some cases months of drill are neces- 
sary before marked improvement is secured. 

The third method tried was a modification of the “look and say” 
or word-whole method in which the recognition of the word is sought 
as a direct response to the visual stimulus. To quote again from Miss 
Lord’s Monograph: 
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“From the results of the vocabulary test given in the special reading 
examination and from incidental remarks made by the subject, it was 
decided that words were not confused on account of the similarity of 
hape, but because they had a number of the same letters. [For example, 
‘my’ was called ‘may’. In explanation of his mistake the boy said that 
he didn’t look for the ‘a’. When he called ‘were’ ‘was’, he said that 
he just looked at the ‘w’.] This type of mistake suggested that if, 
beside the initial letter another letter or letters of phonetic importance 
were brought into evidence by printing them in red, they would become 
the distinguishing signs so that the word could be recognized at sight. 
When two or more words were confused in a lesson these words were 
presented to the subject the next day with their distinguishing !etters 
printed in red. For example, ‘look’, ‘lock’: ‘like’, ‘lake’. These words 
were reviewed for several days and then the subject was tested on the 
same words without any red letters. 

“There were numerous examples to show that these arbitrarily 
chosen letters were remembered as the distinguishing sign in the recog- 
nition of the word. The subject said: ‘After I get used to them I'll 
know them without the red letters’, and would often suggest that the 
words be printed for him suggesting the letters he wished to have in 
red.” 

The last method used was that of phonetic drill. This method was 
carried out until “new words were sounded out correctly although many 
attempts were often made before the correct word was given”. At 
the end of about three months he was able to read a page of Robinson 
Crusoe containing 296 words of which 19 had not been seen in any 
previous lesson. This page was read in five minutes. If he made a 
mistake his attention was called to the fact and he corrected his 
error without assistance. 


PSYCHOLOGICAL ANALYSIS OF DIFFICULTIES 


The anaiysis of the difficulties which this subject presented was 
made by observation of the character of mistakes, the nature of the 
procedure involved, the subject’s spontaneously reported introspections, 
by the use of the short exposure apparatus, etc. First among the pos- 
sible causes of this boy’s difficulties in learning to read would seem to be 
the limitation in his visual and auditory memory spans. His visual 
memory span for both letters and numbers, as tested by the short ex- 
posure apparatus, was limited to a series of three letters or of three 
numbers. In the tests made at the very end of his training, he was 
unable to read four letters or numbers correctly. Familiar words of 
three letters were usually given correctly. The recognition of these 
words within the interval of the exposure may usually be taken as evi- 
dence that the words are read as wholes and not by means of their 
separate elements, altho, of course, some single “dominating” or arbi- 
trarily fixed element may still be used as the clue for the word’s recog- 
nition. The subject’s introspections sometimes, however, gave suppor’ 
to the first interpretation, that is, that the word was recognized as a 
whole. If the word had only three letters he often added, “I can see 








138 BULLETIN OF THE SCHOOL OF EDUCATION 


it all at once”, or “I can look at three letters together’. Such assur- 
ance seldom followed the observation of four or more letters. 

Evidence in regard to the auditory memory span for letters has 
already been given in the discussion of the pedagogical methods em- 
ployed. Words of four letters such as “table”, “plant”, “sieve” had to 
be spelled out by the teacher three or four times before the boy could 
repeat them correctly after the teacher. Words of six letters such as 
“radish”, “flower”, “family” had to be spelled out ten or eleven times 
before the boy could repeat the sequence of letters correctly. 

A second factor at the basis of his difficulties was the inaccuracy 
of his perceptions, as can be judged by the following: Altho all the 
letters of the alphabet could be recognized when presented separately, 
in actual reading, “b” and “p’”, “m” and “n”, “d” and “b” were fre- 
quently confused. “Nest”, for example, was read “most”. In genera! 
the elements of the word form determining its perception are insuffi- 
cient. The shape of the word and the initial letters were usually the 
chief determining factors. The filling in of the further details, which 
were evidently necessary for correct recognition, was a slow process. 
For example, in fixing in mind the word “house”, the boy remarked, “It 
looks just like ‘horse’, but there isn’t any ‘r’ in it.” “Mouse”, he said, 
differed from “mice” because there wasn’t any “i” in it. 

A word once mastered or occurring in the preceding paragraph or 
sentence is likely to be read in the place of similar words. His diffi- 
culties in breaking with an old association, and taking on a new one, 
were at times pathetic. For “on his hands and knees” the boy had read 
“on his hands and keys”, then he corrected himself as follows: “No, 
knees. I say ‘keys’ and look back and make believe there’s no ‘k’ on it, 
because I know he didn’t have any keys.” 

The extent to which one habit of word recognition interfered with 
the forming of another, as is required in the recognition of new but 
similar words and phrases or of changes in the word patterns, may be 
considered as a third factor in the analysis of the boy’s difficulties. The 
words “making rabbit wool” were read “making rabbit holes”. The 
words “rabbit holes” had been associated in a previously read story. 
Frequent corrections in two or three successive lessons were required 
before the word “wool” could be attached to “rabbit”. The difficulty 
in other cases was due to the persistence of what appears to be an 
incorrect motor habit, such as is acquired in the enunciation of words. 
The difficulty is also in part auditory. The following example will illus- 
trate the matter: The word “few” was on first reading called “flew”. 
It took repeated corrections and suggestions thru five successive periods 
before, as the boy said, he could keep the “1” out, and even then it 
was necessary to recall the “total pattern” in which the word “few” 
had been fixed before he could trust himself to say the word. (E.g. 
“I was going to say ‘flew’, but I said ‘a few papers’.”’) 


RESULTS 


The resuits of about three months of intensive drill may be ap- 
praised by comparing the boy’s attainments on the tests which were 
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given at the beginning and at the end of his training. In March, a 
vocabulary test of 510 works from Book I of the Progressive Road to 
Reading resulted in correct responses in the cases of 149 words, or in 
29 per cent of the list. In June, 370 words were read correctly, cr 74 
per cent. In the simplified version of the story of “The Little Red Hen” 
in which the words were presented in their context, there was a total 
of 211 words, of which 138 (or 65 per cent) were read correctly in March 
and 203 (or 97 per cent) in June. 

Out of 296 words in the final reading lesson on June 11, the follow- 
ing were correctly pronounced after only slight hesitation. “Heavy”, 
“search”, “cave”, “finish”, “proud”, “strong”, “high”, “firm”, “enemy”, 
“comfortable”, “safe”. “Spot” was first called “stop” but then given 
correctly. Similar mistakes were the calling of “with”, “which”; “fences” 
“feences”; “so” “soon”; “ladder” “larder”; “no” “on”; “pulled” 
“ploughed”; “slept” “slid”. All of these mistakes were, however, imme- 
diately corrected. 

Miss Lord’s conclusions at the time were, “that the boy had a fairly 
rational approach to reading; he had a basic number of words and 
sounds which he could recognize at sight and sufficient ability in the 
use of phonetics to sound out many unfamiliar words. It might be 
said that he was able to read, that he had the necessary technique, 
but it was my opinion that he would never read for pleasure as reading 
for him was such a slow laborious process.” 

This prognosis which seemed likely at the time has proved, after a 
lapse of years, to have been in error, and the subsequent findings have 
increased our interest in the case. In January, 1923, five years after 
our first examination of the boy, inquiry was made in regard to him 
with the following results: 

“John is now in a trade school. He reached the eighth grade when 
16 years old, having been promoted each year. At the suggestion cf his 
principal he has started a three-year course ‘to learn all about an auto- 
mobile, he’s doing fine. Gets high marks, very good, in history.’ The 
year before when Mrs. R. asked the principal about John’s reading, she 
found that his teacher at that time did not know he had ever had any 
difficulty in reading. During the last years every teacher had said that 
John gave her no trouble. His last teacher said that John was the best 
boy in her room; she could always trust him. ‘He’s crazy about reading 
...He reads all the time...He always has books out of the Public 
Library. He’s reading this one now, Desert Island by Jackson Gregory. 
published by Scribner’s. Five other stories were then shown, one of 
them by Henry. He went to a sale and bought six books with the money 
he had saved.” 

In April, 1923, he was re-examined by the oculist. There was 
practically no change since June, 1918. 

The outcome has caused some revising of our conceptions of “word- 
blind” children. How are we to account for it? The boy had been a 
failure in school; in six or seven years he had scarcely mastered with 
certainty a single word with the possible exception of his own name. 
There were no intellectual defects discoverable except deficiencies in the 
auditory-vocal and visual-vocal span for numbers and letters, and the 
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difficulties above analyzed in the recognition and memory of words, and 
such difficulties in school subjects as would naturally result from these 
special disabilities. 

His failure to form the necessary associations between the visual, 
auditory, and kinaesthetic stimuli involved in letter and word recog- 
nition and the appropriate vocal response may argue for some special 
congenital or acquired defects in association (the evidence that he knew 
his letters before he started in school and that his special difficulties 
followed a severe head injury at about the time of his entrance to school 
would point to an acquired defect). On the other hand, the sensori- 
motor defects indicated in the auditory-memory and visual-memory spans 
for letters and numbers (and quite possibly also the condition of his 
eyes) may have made the establishment of the appropriate associations 
a little more difficult than usual. Their establishment possibly required 
more than the usual drill and required, perhaps, at the start the indi- 
vidual attention which was subsequently necessary before he learned te 
read. The failure to form these initial associations would render suc- 
ceeding instruction useless. The present writer inclines to the latter 
interpretation and favors the following statement of Miss Lord’s: 

“John should be considered a beginner in reading, as, having fajled 
to overcome the initial difficulty of learning a basic number of words 
and the letters essential for the phonetic training, all subsequent train- 
ing was completely beyond his comprehension and he probably developed 
a defensive attitude with inhibitions toward the reading lesson.” 

A certain feeling of inadequacy had unquestionably been bvilt up in 
this case as a result of the continual failures in school. It is indicated 
in a remark which the boy made:to his mother, that he had never 
been understood before he received this special individual instruction. 
His troublesomeness and poor conduct in school was also doubtless a 
resultant of this feeling. In the cases assembled by Miss Hincks in the 
second Monograph, to which reference has been made above, the presence 
of such psychopathic tendencies has been more specially inquired into. 
The following brief quotation from the conclusions of this latter Mono- 
graph will serve to summarize this phase of our discussion of these cases. 

“Although our subjects were chosen for their reading difficulty 
alone, upon investigation their histories exhibited nervous traits, either 
in their own behavior or in that of their immediate families. Except 
for Florence «nd Andrew our children all had cultivated parents, many 
of whom were highly distinguished for intellectual work. In several 
cases the homes lacked harmony and wise discipline. Our subjects 
showed such reactions as nightmares, hesitation in speech, anxiety about 
health, willfulness, rudeness to elders, stubborness, untruthfulness, over 
sensitiveness, boastfulness, selfishness, repression, high fatiguability, and 
emotional outbreaks of crying and tantrums. 

“With a great deal of individual help and effort the reading dis- 
ability has been greatly lessened, and in some cases almost removed. 
Where the nervousness was not very severe the learning to read when 
it was finally achieved, was accompanied by great improvement in be- 
havior. In one case where the neurctic trend was deeply imbedded in 











EDUCATIONAL MEASUREMENTS CONFERENCE 141 


the family life improvement in reading has not as yet affected the gen- 
eral symptoms. 

“We conclude that there are certain perceptual irregularities which 
may combine to make learning to read more difficult for the child 
possessing them, than for the average, and more difficult than other 
kinds of learning for the same child, although the difficulty is by no 
means unsurmountable, and should not be designated ‘word-blindness’ 
When these irregularities* are transmitted in stock which is nervous, 
highly organized, and unstable, there is not usually found the patience 
and persistence to overcome them. Where they are found, as in our 
cases, in families of intellectual achievement where academic facility 
is expected, there is an added stigma and distress to the mind of the 
unfortunate child. The intellectual difficulty is increased by the emo- 
tional traits of the child, which are in turn augmented by the dis- 
approval and worry of excitable parents. An inferiority complex is 
built up by an unfavorable interaction in which the reading difficulty 
adds to the general nervousness of the child’s condition, and this condi- 
tion adds further inhibition to the learning processes, and irritation to 
the parents, so that a general family and school maladjustment occurs.” 











